Hive ETL (Extract Transfer Load)

Let us study Hive ETL (Extract Transfer Load) tool,

Introduction

* Apache Hive as an ETL tool built on top of Hadoop ecosystem.

* It provides functionalities like,

  1. It supports Data modeling (representation of the data structures in a table for a company’s database)
  2. Supports data manipulation (inserting, deleting, and modifying updating data in a database)
  3. It supports data processing (it is a sequence of operations performed to convert raw data into usable form)
  4. Data querying (Method process of asking questions of data in search of a specific answer).

* Table Creation and loading structured and semi structured data as well as querying data based on the requirements is called as data extraction.

1. Working on Structured data

Structured data means that data stored in the tables formats i.e rows and columns. It is similar to RDBMS data. Example Text data.

Example

Steps to process structured data in Hive

  1. Create table called employee with the columns like name, id, age and salary.
  2. Load the data from emp.txt into employee table.
  3. Display the content stored in employee table using select command.
Queries                                                                                   
hive> create table employee (name string, id int,  age int, salary float)

> Row format delimited

> Fields terminated by ‘,’;

hive> load data local inpath ‘/user/beyondCorner/emp.txt’ into table employee;

hive> select * from employee;

Note: In this example text file may contains csv (comma separated value) or tsv (tab separated value) data.

2. Working on Semi-Structured data

It is similar to the structured data but it does not contain formal structure, and contains tags or markers to separate semantic elements. It enforce hierarchies of records and fields within the data.

Example XML, JSON.

Example

Steps to process XML (extensible markup language) data in Hive

  1. Create table called xmlemply with the columns like name, id, age and salary.
  2. Load the data from emp.xml into employee table.
  3. Display the xml raw data stored in xmlemply table using select command.
 Queries
hive> create table xmlemply(str string);

>load data local inpath ‘/user/beyondCorner/emp.xml’ overwrite  into table xmlemply;

> select * from emp.xml;

Steps to process JSON (JavaScript Object Notation) in Hive.

  1. Create table called json_emply with the columns like name, id, age and salary.
  2. Load the data from emp.json into json_emply table.
  3. Display the actual schema stored in json_emply table using select command.
Queries
hive> create table json_emply(str string);

> load data inpath ‘/user/beyondCorner/emp.json’ into table json_emply;

> select * from json_emply;

 Note: Twitter and websites data is stored in JSON format.

Conclusion

From the Hive ETL topic we can conclude that, ETL tool is used to process both structured and semi-structured data in bigdata.