HBase Data Flow Mechanism Architecture

Let’s study about HBase Data Flow Mechanism Architecture,

HBase Architecture

Hbase mainly contains four major components like,

  1. HMaster
  2. Region server
  3. Regions
  4. Zookeeper

HMaster

* HMaster monitors a collection of Region Server which is present in the cluster (As shown above).

* In distributed cluster environment Hmaster co-ordinates and manages the Region Server similar as NameNode manages DataNode in HDFS.

* It also performs Data definition language (DDL) operations like create and delete tables and assigns task to the Region servers.

*It Provides high availability by controlling the failovers and also performs recovery activities whenever any Region Server is down.

* HMaster takes responsibility to change any schema and Metadata operations, based on client request.

* It also performs some of administrative tasks such as load balancing, creating, updating, deleting tables etc.

Region server

* Region server is responsible for serving and managing regions.

* It is a worker nodes in the cluster which handle read, write, update, and delete requests received from clients.

* Region Server is light weight process, it runs on every node of the Hadoop cluster.

* The main task of the region server is to store the data into regions and perform the requests received from the client application.

* Region Server has following components, when it is run on the HDFS data node.

  1. Block Cache is the read cache. Often used read data is stored in the read cache and whenever the block cache is full, recently used data is evicted.
  2. MemStore is the write cache and stores new data that is not yet written to the disk. Every column family in a region has a MemStore.
  3. Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent storage.

Region

* Region is the basic building elements of HBase cluster.

* It is nothing but tables that are split up and spread across the region servers.

* It has multiple stores for storing the column family.

* Region mainly conations two components like,

  1. Memstore –It holds in-memory modifications to the Store. Modifications are KeyValues.
  2. Hfile (StoreFiles) – It is the region where data lives.

Zookeeper

* ZooKeeper is a centralized monitoring server that maintains configuration information, naming, providing distributed synchronization, etc.

* It keeps track of all the region servers in the HBase cluster and tracking information like, how many region servers are there and which region servers are holding which DataNode.

* Zookeeper provides services like,

  1. Establishing client communication with region servers.
  2. Tracking server failure and network partitions.
  3. Maintain Configuration Information.
  4. Provides ephemeral nodes (It is temporary kind of nodes. These exist for a specific session only. They gets created for a session and as soon as session ends they also get deleted) which represent different region servers.

Data Flow Mechanism

Let’s learn the Data Flow Mechanism in Hbase with respect to the read and write operations.

Example

Read operation

In Hbase read operation involves following steps.

  1. Firstly client send read request to Zookeeper, it provides the address of the table from the Meta data.
  2. Then process continues to the Region server, if data found it return to the client else process continues.
  3. Here read activity continues from WAL to MemStore for searching data. If data found it return to the client else process continues.
  4. Finally process move towards HFile, once the required data is found, it will be returned to the client along with ACK (Acknowledgment).

Write operation

In Hbase write operation involves following steps.

  1. When client request for write operation, its directed to Write Ahead Log (WAL) .
  2. Once the log entry is done, then written data is forwarded to MemStore which is actually the RAM of the data node. All the data is written in MemStore which is faster than RDBMS (Relational databases).
  3. Later, the data is dumped into HFile, it is a actual data storage in HDFS. If the MemCache is full, then data will be stored in HFile directly.
  4. Once writing operation is completed, ACK is sent back to the client as a confirmation of task completed.

Reference

http://hbase.apache.org/0.94/book/architecture.html