Hive Data Storage

Hive has different forms of Hive Data Storage options, they are explained below.

1. Metastore

* It is the central repository (location where data is stored and managed) of hive metadata.

* It stores metadata of hive tables like their schema and location.

* Hive Metastore keeps track of all the metadata of database, tables, columns, data types etc.

* It also keeps track of HDFS mapping i.e it keep track of data where it stored in network.

* Client can access schema and location information using metastore service API.

2. Tables

* Hive tables are same as the tables present in a Relational Database.

* The hive tables are made up of logically related data and layout of data stored in metadata.

* It is used to perform filter, project, join and union operations on tables.

* Hive has two types of tables. They are,

  1. Internal table (Managed table)
  2. External table
Internal table
  • It is also called as managed table.
  • When internal table is created by default data stores in warehouse directory(Internal table is briefly explained in hive commands topic).
  • Data loss problem is there with internal table.
  • Global usage is not there.
External table
  • When external table is created data stores outside the warehouse directory(External table is briefly explained in hive commands topic).
  • Data loss problem is not there.
  • Global usage is there.

Note: The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of internal table the data also gets deleted on deleting the table.

3. Partitions

* It is a method of seperating a table into related parts based on the columns values like date, city, and department.

* Partitioned data stored in different sub directory is shown in below diagram.

* It improves query performance.

* It is very easy to do queries on partition table.

4. Buckets

* Hive partition separates table into number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets.

* It speeds up joins and sampling of data.

* Bucketing concept is based on Hash function.

* It supports testing and debugging on huge data set.


From the above topic we can conclude that hive uses different data sources like table, metastore, partition , and buckets to store data.