Hive ETL (Extract Transfer Load)
Let us study Hive ETL (Extract Transfer Load) tool,
* Apache Hive as an ETL tool built on top of Hadoop ecosystem.
* It provides functionalities like,
- It supports Data modeling (representation of the data structures in a table for a company’s database)
- Supports data manipulation (inserting, deleting, and modifying updating data in a database)
- It supports data processing (it is a sequence of operations performed to convert raw data into usable form)
- Data querying (Method process of asking questions of data in search of a specific answer).
* Table Creation and loading structured and semi structured data as well as querying data based on the requirements is called as data extraction.
1. Working on Structured data
Structured data means that data stored in the tables formats i.e rows and columns. It is similar to RDBMS data. Example Text data.
Steps to process structured data in Hive
- Create table called employee with the columns like name, id, age and salary.
- Load the data from emp.txt into employee table.
- Display the content stored in employee table using select command.
|hive> create table employee (name string, id int, age int, salary float)|
> Row format delimited
> Fields terminated by ‘,’;
hive> load data local inpath ‘/user/beyondCorner/emp.txt’ into table employee;
hive> select * from employee;
Note: In this example text file may contains csv (comma separated value) or tsv (tab separated value) data.
2. Working on Semi-Structured data
It is similar to the structured data but it does not contain formal structure, and contains tags or markers to separate semantic elements. It enforce hierarchies of records and fields within the data.
Example XML, JSON.
Steps to process XML (extensible markup language) data in Hive
- Create table called xmlemply with the columns like name, id, age and salary.
- Load the data from emp.xml into employee table.
- Display the xml raw data stored in xmlemply table using select command.
|hive> create table xmlemply(str string);|
>load data local inpath ‘/user/beyondCorner/emp.xml’ overwrite into table xmlemply;
> select * from emp.xml;
- Create table called json_emply with the columns like name, id, age and salary.
- Load the data from emp.json into json_emply table.
- Display the actual schema stored in json_emply table using select command.
|hive> create table json_emply(str string);|
> load data inpath ‘/user/beyondCorner/emp.json’ into table json_emply;
> select * from json_emply;
Note: Twitter and websites data is stored in JSON format.
From the Hive ETL topic we can conclude that, ETL tool is used to process both structured and semi-structured data in bigdata.