Big Data Technologies

List of major available frameworks for Big Data technologies

Apache Hadoop

Apache Hive

Apache Pig

Apache HBase

Apache Sqoop

Apache Flume

Apache Strom

Apache spark

Apache Flink

Apache Mahout

Apache Oozie

Apache Zookeeper

Apache Ambari


Hadoop     (Open source)


  1. Hadoop is a open source Big Data platform which is used for storing the data in distributed environment and for processing the very large amount of data sets.
  2. Hadoop is based on MapReduce system.
  3. MapReduce job usually splits the input data-set into independent chunks which are processed by the mapper tasks parallely on different different machine. After processing of independent chunks by mapper, Hadoop framework sorts the outputs of the mappers and provide the input to the reducer to generate the final output.

Hive   (Open source)

  1. Hive is build on the top of Apache Hadoop.
  2. Hive was developed for Sql developers, so that they can perform the analysis on data.
  3. Analysis on data can be performed using SQL, Working on Hive is easier who has the prior knowledge of SQL queries.
  4. Hive can be access the data either directly in Apache HDFS (HDFS is a part of Hadoop cluster ) or another storage systems, it can be Apache HBase (Storage system).
  5. Hive can be used for warehousing tasks, which can be used for Data Analysis, Data Reporting and ETL (Extract/Transformation/Loading).


Pig   (Open source)

  1. Pig run on the top of Apache Hadoop.
  2. Pig has its own high level scripting language. Its is very simple language and SQL like scripting language called is Pig Latin.
  3. Pig scripts are translated into a MapReduce jobs internally and that jobs run on the Apache Hadoop cluster.
  4. Easy to work with Pig if the developers already familiar with scripting languages.


HBase     (Open source)

  1. A non-relational (NoSQl) database that run on the top of HDFS (Hadoop Distributed File System).
  2. HBase provide real-time read/write access to large dataset.
  3. HBase can handle huge data set with billions of rows and columns and it can also combine wide variety of different structures and schemas for analysing the data.


Sqoop    (Open Source)

  1. Sqoop is an application for transferring the data between Hadoop and relational databases.
  2. Sqoop is used for import and export of data.
  3. Data can be exported to RDBMS(Oracle, Sql and etc) from Hadoop using Sqoop.
  4. Data can be imported to Hadoop from RDBMS(Oracle, Sql and etc) using Sqoop.


Flume    (Open Source)

  1. Flume provide the service for efficiently collecting the logs data for analysis, also it’s useful for aggregating, and moving large amounts of log data.
  2. It is useful for analysing the log data, flume continue pick the log data from log folder put that data into a hdfs directory for analysis.


Apache Strom (Open Source)

  1. Apache Strom is mainly used for real time processing.
  2. Can be used with any programming language.
  3. Its extremely fast, it can process millions of record per second on one node.
  4.  Real time analytics example: Suppose you are a new customer to the site and frequently visiting the site. They will give you different different offers on real time.


Apache Oozie (Open Source)

  1. Its used to define the workflow of jobs in Hadoop’s environment system.
  2. In other terms, you have multiple jobs that jobs can be pipelined in a desired order to work in Hadoop’s Environment.


Apache Ambari (Open Source)

  1. It monitor the Hadoop cluster
  2. It manage the Hadoop cluster
  3. It provide provisioning and security to Hadoop cluster.

Thanks for your contribution! We are updating or adding the articles on daily basis.