Hive Partition and Bucketing with example

Let us discuss about Hive partition and bucketing.

Hive Partition

  • It is a way of dividing a table into related parts based on the values of partitioned columns. example date, city and department.
  • It is helpful when the table has one or more Partition keys.
  • Hive partition is a sub-directory in the table directory.
  • It is a basic unit of data storage method used in Apache hive (explained in the previous article).

Advantages      

  • It reduces the query latency (delay) by scanning only relevant partitioned data instead of the whole data set. i.e No need to search entire table columns for a single record.
  • It distributes execution load horizontally.
  • Hive partition reduces the query processing time on the partitioned data set.
  • It is suitable for low volume of data. e.g. Calculate the population of “Bangalore city” is very fast instead of searching entire population of world.

Disadvantages

  • Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of India will take a long time as compared to a grouping of the population in Goa.
  • It results in too many directories. This is extra burden for namenode metadata.
  • It is not suitable for high volume of data, it take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city.
  • We cannot guarantee for query optimization for all the times.

Example

    Consider an Indian company ABC, which is having its branches in different cities (more then 10 cities) of India. Particularly we need to calculate number of employees working in Bangalore, in this case we can use hive partition.

Here we take city column as partition key and perform partitions on that ABC Indian company data as a whole, we can get number of employees working in ABC Bangalore.

 Table Creation

hive> create table ABC_emp(id int,name string,city string)

> row format delimited

>fields terminated by ‘,’;

 Loading data into ABC_emp table
hive> Load data local inpath ‘ABC_emp.csv’ into table ABC_emp;
 Creation of partition table
hive> create table ABC_part(id int,name string)

>partitioned by (city string);

 For partition we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict
Loading data into partition table
hive>insert overwrite table ABC_part partition (city)

select id,name,city from ABC_emp;

Queries

Renaming a Partition

Here we are renaming the partition column.

hive> alter table ABC_part partition (city)

> rename to partition (city1);

Dropping a Partition

Here we are dropping partition column.

hive> alter table ABC_part DROP [IF EXISTS]

> partition (city1);

Bucketing

  • Hive bucketing is a method for dividing the data into number of equal parts.
  • CLUSTERED BY command is used While creating bucketing in hive.
  • Hive bucketing concept can be performed on internal tables or External tables.
  • Hive bucketing can perform only on one column to get best result.
  • In hive command line, below command is used to enable buckets.
set.hive.enforce.bucketing=true;

Advantages

  • Hive bucketing is used for efficient querying.
  • It is used in Map side joining because data is equally distributed.

Disadvantage

  • Loading of equal volume of data has to be done by programmer manually.

Example

Creating Bucket
hive>create table empbucket(name string,id int,city string,salary string)

> clustred by (city) into 4 buckets

> row format delimited

>fields terminated by ‘,’;

Loading Data

From the existing table employee we are loading data to the empbucket.

Note: Existing table employee should contain similar columns.

hive> from employee

> insert overwrite table empbucket

> select name,id,city,salary;

Display

Here we are displaying the four buckets.

$ hadoop fs –ls /user/hive/

References

https://cwiki.apache.org