Hive View and Index

Let us discuss Hive View and Index.

Hive View

Objective

The main objective of creating hive view is to simplify the complexities of a larger table into a more Flat structure. For example, if you have a table that has 100 columns, but you are only interested in 10 columns, you could create a View with those 10 columns.

Features

  • Hive Views are similar to tables or “copy” of the table.
  • Hive Views are generated based on the user requirements.
  • It is similar to SQL view.
  • View Support All DML operations.
  • Subsequent changes in the below table will not be reflected in the View, and if the table is dropped, the view will fail.

Advantage       

  • Hive View simplifying the query.
  • It can be used as staging tables (database tables containing business data) to process more complex processes.

Disadvantage             

  • It does not support for materialized views (it is a database object that contains the results of a query).
  • Views support read-only and cannot use these commands LOAD/INSERT/ALTER directly for changing metadata.

Examples          

Let us take an example for view. Consider an employee table as given below, with the columns Id, Name, Salary, and Dept.

IdNameSalaryDept
101Rosy30000Revenue
102Leela23000Medical
103Sitha25000Teaching
104Ravi35000Admin
105Meena20000Tax

Create View

In this example, we are creating view called “emp_25000” where it will display all the row values with salary field greater than 25000.

hive> create view emp_25000 AS

>select * from employee

>where salary>25000;

Display the View

hive>select * from emp_25000;

Dropping View      

This quiery drops a view named called as “emp_25000”

hive> drop view emp_25000;

Index

  • Hive Index is nothing but a pointer on a particular column of a table.
  • The user has to manually define the index.
  • Indexes maintain the reference of the records. So that it is easy to search for a record with minimum overhead.
  • Indexes also speed up the searching of data.

Need of indexing in Hive

Apache Hive is a data warehousing tool build on the top of Hadoop, it provides the SQL kind of interface to perform queries on large data sets. Since Hive deals with Big Data, the size of files is very large and can expand up to Terabytes and Petabytes. Now if we want to perform any operation or a query on this huge amount of data it will take more time.

In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take more time because queries will be executed on all the columns present in the table.

Advantage

  • Hive index supports partition, depending on the size of the data.
  • It simply the searching of data in hive query i.e whenever we perform a query on a table that has an index, there is no need to scan all the rows in the table. Further, it checks the index first and then goes to the particular column and performs the operation.

Application

  • Mainly used when dataset is very large.
  • It is used for building a data model.

Example

Let’s take an example for index. Use the above employee table with the columns Id, Name, Salary, and Dept. Create an index named “indx_salary” on the salary column of the employee table.

Create index

The following query creates an index

hive> create index indx_salary

>on table employee(salary)

> AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’

> WITH DEFERRED REBUILD;

Here indx_salary is a pointer to the salary column. If the column is modified then changes are stored using an index value.

Note:
  1. The ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ means built in CompactIndexHandler helps to create an index, i.e creating a compact index for the table.
  2. ‘WITH DEFERRED REBUILD’ is a keyword statement used in hive index table, to alter the created index table in later stages using this statement i.e (rebuilds a previously created index on the table)
  3. Deferred Rebuild creates a directory under /user/hive/warehouse, instead of own user ID. When rebuilding index, we can’t drop index table storage under /user/hive/warehouse due to its sticky bit setting.
  4. It is also used to create empty index i.e (without loading any values).

 Alter an index

hive>alter index indx_salary on employee REBUILD;

Display an index              

Using below command we can check the available indexes for the table “employee”

hive> show formatted index on employee;

Dropping an Index

The following query used to drops an index named “index_salary”

hive> drop index index_salary ON employee;

References

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView