HDFS Master-Slave Topology
HDFS Master-Slave Topology, it has two nodes Master (Name Node) and Slave (Data Node) in the cluster.
Master Node also known as Name Node:
- The Client contacts HDFS Master- Name Node to access the files in cluster.
- Name Node has the Meta data and it takes care of client authentication, space allocation to actual data and details about actual storage location etc.
- Name node also maintains slave node, assign task to them and had a track of slave node performance failure etc.,
Since Name Node is the master in the cluster it should be deployed on the reliable hardware.
Slave Node also known as Data Node:
- Data storage in HDFS is managed by number of HDFS Slave-Data Nodes.
- Slave-Data nodes are actually worker nodes which will do the assigned works by master node.
- Slave data node can perform read and write request from the file system’s client. I
- t can also perform block creation, deletion and replicating the created block as many number instructed by Master Name Node.
- Data Nodes can be deployed on commodity hardware and need not to be deployed on very reliable hardware since the data in the slave nodes are replicated in other data nodes. So in case of failure in one hardware data can be retrieved from other data nodes placed in different hardware.
HDFS has 2 daemons which run for storing data.
- Name Node:This is the daemon that runs on all the masters. Name node stores the metadata like filename, number of replicas, the number of blocks, block IDs and location of blocks, etc. This metadata in the master is used for faster retrieval of data. Name node memory should be high as per the requirement.
- Data Node:This is the daemon that runs on the slave. These are an actual worker node that stores the data.
HDFS broke the file to be written in small piece of data known as Blocks. The default block size is 128 MB, which can be increased as per the requirement. These blocks are stored in the cluster in distributed manner on different nodes and the blocks stores are also replicated to multiple data nodes.
This replication of stored data is done to avoid any data loss in case one data node failure. This provides a mechanism for MapReduce (we will discuss in future) to process the data in parallel in the cluster. HDFS splits large file into N number of small blocks and stores it in different node across the cluster in distributed manner. By default HDFS replicate the each block 3 times across different nodes in the cluster.
HDFS Rack Awareness:
Its also important like HDFS Master-Slave Topology topic. As we can see from the above picture Hadoop contains multiple data nodes in a cluster of computers which are commonly spread across many racks. To improve fault tolerance HDFS Master-Name Node places replica of blocks of data in multiple racks. HDFS Master-Name Node places at least one replica in one rack, even though a complete rack got crushes blocks will be available from other rack. So the purpose of Rack replica is to increase the reliability and availability of the data stored.
Hope you have got the idea of HDFS Rack Awareness along with HDFS Master-Slave Topology, now lets move to HDFS Architecture.