Apache HDFS Read Write Operations
HDFS support two operations,
- HDFS Read Operation
- HDFS Write Operation
lets start with Apache HDFS Read Operation under the Apache HDFS Read Write Operations section.
Apache HDFS File Read Operation
As you know, HDFS works on the principle of Master-Slave Nodes. Name Node being acting as the master and Data Node is acting as the slave.
- If a client wants to read any data from the HDFS he has to first interact with the Name node raising a request to the location the particular data block being stored and also the permission to access the data node where the requested data block being stored.
- Once the location and authentication token provided by the Name Node to the client then the client can directly interact with the Data Node and read the data. It is basically like getting details & permission from a gatekeeper to fetch things from his watch area boundaries.
When the client wants to read the data that is stored in HDFS, he has to interact with the Name Node first. So from the client, the request for data block location is sent to the name node through distributed file system API. Name Node after checking whether the client has credentials to the requested data, sends the address of the data block and also a security token to access the data.
So now client can directly fetch data from the data node showing the security token. If while reading a data block from a data node if that machine crashes then Name Node will set out the location and access token to another data node where the same data is stored. So the client can fetch data from that machine.
Apache HDFS File Write Operation
Writing data to the HDFS is similar to that of reading data from HDFS.
- The client has to first contact distributed file system API to get the slave/data node location to write the data blocks.
- Name Node sends the location to the client the where the data has to be written. The client now interacts with data node and starts writing the data through FS data output stream.
- Then the data written on one data node is copied to second data node and from that node to the third data node. So the slave does the job of replicating the written data to other nodes. If the replication factor is configured to 4 then the data written will be replicated in 4 data nodes. After the replication is done an acknowledgement is sent to the Name Node & Client. Acknowledgement sent from the third slave node to the second slave node and from second to the first slave node and then to the Client & Name Node.
- The client will be writing only one copy of the data into HDFS irrespective of any number of times it is replicated inside the data node which is based on the default replication factor.
So writing data in HDFS is highly efficient as multiple blocks of data are written in different data nodes in parallel and replicated internally. Hope you easily understand the Apache HDFS Read Write Operations.
“That’s all about the Read and Write Operation of the HDFS”