Apache HDFS Features

Below are the listed feature of HDFS

There are lots of Apache HDFS Features, Let’s discuss that.

HDFS Distributed Storage:

HDFS is to help the client to divide large volume of data into small blocks of data and distribute it to store in multiple machines/data nodes and also provide easy access to those data sets. Due its replication of data storage in multiple data nodes, loss of data is avoided even in case of failure in one data node or in one machine. HDFS storage system enhances MapReduce to work on its full potential.

HDFS Blocks:

As seen before in distributed storage, HDFS splits large volume of data into small pieces.

  • Those split small pieces of data is referred as blocks in Hadoop. Name Nodes has complete control over these blocks, it allocates location by deciding which Data Node these blocks to be stored. HDFS default block sizes are 128MB. It can be altered based on requirements.
  • If a client requests HDFS to store a data of 140 MB, then the data will be stored in 2 blocks, one block with 128MB and another with 12 MB. So block size also depends upon the size of data. Instead of creating a second block with default 128MB, it has generated a block just for the size of 12MB.

So once the data is split into multiple blocks it is stored at different data nodes with a default of 3 replicas/duplicates of each.

For example a block stored in data node 1 will also be copy pasted to data node 2 and data node 3. So its basically duplicating one block of data in three data nodes which makes it fault tolerant.

Replication:

As the title indicates it is nothing but duplication of data. Every single data block which is stored in a machine/node will have its duplicate copies at least in another 2 data node placed in different racks across the cluster.

  • HDFS keeps creating replicas of user data on different machines present in the cluster.
  • HDFS default replication factor is 3 and it can be altered for the requirements. The change in replication factor value can be done by editing the configuration files. The process of duplicating data blocks is called Replication.

Important Point: Data Nodes are arranged in racks. HDFS has multiple racks containing data nodes. All the data nodes in a single rack are connected by a single switch so if a switch or a complete rack is down then the same data can be accessed from another rack. This is possible only because of the principle of replication of data across multiple racks.

Availability, Reliability & Scalability:

Storing data in multiple data nodes and racks makes it highly available.  Even when the machine or data node or network link goes down, data can be easily retrieved by the client as it is duplicated  at least in two other data nodes. This in turn makes HDFS stored data as highly available data.

Replication is one big reason for HDFS storage being the most reliable. Even if hardware fails or a node crashes, the same data in the crashed node can be retrieved from another data node placed in another rack. So replication and high availability of data blocks makes HDFS as a highly reliable data storage system.

Scalability is nothing but increasing or decreasing the size of the cluster. So in HDFS scalability is done in two ways.

  • The first method of scaling is by adding more disks on nodes of the cluster. Practically this is done by editing the configuration files to make the entries of the newly added disks. This method of requires a down time but may be very small. So because of the down time in first method,
  • second method has become the most preferred method for scaling up which is called horizontal scaling. In this method more nodes are added to the cluster on the go without any down time. As many nodes can be added to the cluster on the go in real time. This is a unique feature provided by Hadoop.

HDFS Fault Tolerant:

The features explained above make HDFS a fault tolerant storage system for Hadoop. HDFS works with the commodity hardware which is nothing but systems with average configuration you can also says home PC. So there is always a high chance of system crash any time and loss of data but HDFS replicates data in at least 3 different machines so even in case of two machines failure data can be retrieved from the third one. So no chance of losing data and will replication helps Hadoop for Fault tolerant.

High throughput access to application data:

Through is nothing but the amount of work done in a particular unit of time. As we all know from the previous topics that in HDFS a task requested is divided and shared among different systems in it. So once the systems are assigned with their task, they start executing it independently and also in parallel. So the turnaround time of task shared to multiple resources working in parallel will be very less. So reading and writing of data done in parallel makes it as a highly efficient framework and thus providing high throughput access.

Some of the other salient Apache HDFS Features

  1. HDFS default configuration is well suited for many installations.
  2. The Name node and Data node built in servers helps to check the current status of the cluster.
  3. Some other subset of useful features in HDFS include
    1. File Permissions & Authentication
    2. Safemode: As the name indicates it is an administrative mode for maintenance. This is basically a read only mode for the HDFS cluster and during this mode no modifications are allowed in file system or blocks.
    3. Rack awareness: Allocating storage and allocating tasks considering a node’s physical location.
    4. Rebalancer:  When the data unevenly distributed among Data Nodes, Rebalancer is a tool to balance the cluster.
    5. Secondary Name Node: A boss – (a manager to the master Name Node) – who performs a periodic check on the Namespace and helps to keep the size of the file containing the log of HDFS modifications with certain limits at the Name Node.
    6. Fsck : A Doctor – this is utility which will check the health of the file system, and find out if there are any missing blocks or files.
    7. Fetchdt: The Lock & Key – this is the utility to fetch the delegation token or in other words authentication and stores it in a file on the local system.

That’s all about the Apache HDFS Features hope you got the idea. Hadoop features impress developers to choose this technology Now, Lets move to Apache HDFS Operations.

Hadoop features