FAQ in Hadoop

Let’s have a look on FAQ in Hadoop.

Questins with Answers

Easy Level

1. What is Big Bata?

* Big data is large data sets produced from human actions  for every day/hour/seconds.

* It is a large datasets of both structured and unstructured.

* It can be analysed to find patterns, trends and associations mostly relating human behaviour and interactions that leads to strategic decision making.

2. What are all the sources of Big Data? Or What comes under Big Data?

Data collected mostly from internet, different devices and applications.

  • Social Media Data
  • Black Box Data
  • Stock Exchange Data
  • Search Engine Data
  • Power Grid Data
  • Transport Data

3. What are the types of Big Data?

All that huge volume and extensively variety of datasets are classified into three types.

  1. Structured Data : Relational data.
  2. Semi Structured Data    : XML data.
  3. Unstructured Data : Word, PDF, Text, Media Logs.

4. What are the major challenges faced to analyse big data?

The major challenges which are associated with big data are as follows;

  • Capturing data
  • Curation
  • Storage
  • Searching
  • Sharing
  • Transfer
  • Analysis
  • Presentation

5. What is Apache Hadoop?

hadoop is an open source software framework which has become a solution provider to  the “BigData” problems. The Apache Software Foundation (ASF) sponsored Hadoop as a part of their Apachee project. It is used for distributed storage and distributed processing of large data sets.

6. What is the basic advantage of Hadoop?

* It is open source framework so it is freely available and even we can change its source code as per our requirements.

* An application can run on the system with thousands of commodity hardware nodes.

* It enables data to be transferred rapidly among nodes.

* A process can continue in operation even in case of node failure.

7. Why do we need Hadoop?

The challenges we face to deal with Big Data are many as follows;

  • Storage, Security and Quality of data then
  • Analysing and Discovering algorithms to create strategies.

So Hadoop came into picture to deal with these Big Data challenges. It is the best solution to store and process data because of its significant features as follows;

  • High Scalability
  • Reliable & High availability
  • High Storage capacity and Economic

8. What are the core components of Hadoop and its functions?

There are three main layers of Hadoop which are as follows;

  1. Storage Layer – HDFS
  2. Processing layer – MapReduce
  3. Resource Management Layer – YARN

HDFS – HDFS is primarily used by Hadoop applications for distributed storage. It is the world’s most reliable storage system. Basic purpose of HDFS is to divide huge amount of data and store it in multiple machines and provide easy access to those data sets. HDFS makes sure that no data is lost in case of system failure and makes the applications available for parallel processing. HDFS is completely written in JAVA programming and it is based on Google File System (GFS).

MapReduce – MapReduce is the processing layer of Hadoop which is based on Java. The processing part of datasets is taken care by MapReduce. MapReduce is one of the world’s best data processing frameworks. It is capable of dealing with large volume of data. MapReduce is a programming model which can divide a work into a set of independent tasks and by doing this way it can process large volume of data in parallel. Map and Reduce are the two major tasks performed by the MapReduce framework. Map as first phase and Reduce as second phase.

YARN – Yet Another Resource Negotiator. The resource manager for the processing part of Hadoop 2.x and 3.x is called YARN. From the name we can understand that it deals with the resource and its negotiation. In the previous version of Hadoop “Job Tracker” was taking the responsibility of managing the jobs submitted to Hadoop but it had some functional and performance disadvantages so YARN was introduced in Hadoop 2.0 overcome those drawbacks of Job Tracker.

9. What are the features of Hadoop?

Open Source – It is open source framework so it is freely available and even we can change its source code as per our requirements.

Fault Tolerant -The features explained above make HDFS a fault tolerant storage system for Hadoop. HDFS works with the commodity hardware which is nothing but systems with average configuration. So there is always a high chance of system crash any time and loss of data but HDFS replicates data in at least 3 different machines so even in case of two machines failure, data can be retrieved from the third one.

High Availability – Storing data in multiple data nodes and racks makes it highly available.  Even when the machine or data node or network link goes down, data can be easily retrieved by the client as it is duplicated  at least in two other data nodes.

Reliable – Replication is one big reason for HDFS storage being the most reliable. Even if hardware fails or a node crashes, the same data in the crashed node can be retrieved from another data node placed in another rack.

High Scalability – Scalability is nothing but increasing or decreasing the size of the cluster. So in HDFS scalability is done in two ways. The first method of scaling is by adding more disks on nodes of the cluster. The second method has become the most preferred method for scaling up which is called horizontal scaling.

Economic – Hadoop runs on local commodity hardware, we do not need any specialised machine for it. So it is not very expensive.

Ease to use – It is easy to use because the framework itself takes care of the distributed computing. No need of client to deal with distributed computing.

Average Level

10. What are the modes in which Hadoop runs?

Apache Hadoop runs in three modes:

Local (Standalone) Mode – Hadoop by default run in single node, it is a non-distributed mode which run as a single Java process. It does not support the use of HDFS.

Pseudo-Distributed Mode – It is similar to that of Standalone Mode. It runs on single node but each daemon runs in different Java process. Since all daemons run on single node, both master and slave are the same.

Fully-Distributed Mode – All daemons are executed in separate nodes by means forming a multi-node cluster. Two different nodes act as Master and Slave.

11. What are the features of Fully-Distributed Mode?

It has Name Node and Data Node. Name Node has the metadata of all data stored in the HDFS. Name node acts as Master and Data node acts as slave. Hadoop Daemons run on cluster of machines. There is Node manager installed in every data node which is responsible for the execution of task on every data node. Resource manager manages the Node manages. Client contacts the resource manager for Job execution.

12. What is Safemode in Hadoop?

As the name indicates it is an administrative mode for maintenance. This is basically a read only mode for the HDFS cluster and during this mode no modifications are allowed in file system or blocks. At the start-up of Name Node:

  • From the last saved FsImage, it loads the file system namespace into its main memory and the edits log file.
  • Merges edits log file on FsImage and thus a new file system namespace is generated.
  • Information about block location from all data nodes are received as block reports.

During its start up NameNode enters Safemode automatically. After the DataNodes have reported that most blocks are available then NameNode leaves Safemode. Use the command:
hadoop dfsadmin –safemode get: To know the status of Safemode
bin/hadoop dfsadmin –safemode enter: To enter Safemode
hadoop dfsadmin -safemode leave: To come out of Safemode
NameNode front page shows whether safemode is on or off.

13. What is “Distributed Cache in Apache Hadoop?

In hadoop, data chunks are processed in parallel stored in multiple Data Nodes using a user written program. Files from all data nodes which a user wants to access during the execution of the application will be kept in the distributed cache.

Distributed Cache is a facility provided by Hadoop MapReduce framework to cache/access files which are needed by the application during its execution. It can cache the read only text files, jar files, archives etc. Once a file is cached then it is copied and placed in the local system of the every slave node before the MapReduce tasks are executed on that node.

This will increase the performance of the task by saving the time and resource required for input/output operations.  In some cases it is necessary for every Mapper to read a particular file which can be now done by reading it from cache.

For the application in order to use the distributed cache, the required file need to be moved to distributed cache. We should make sure that the file is available to access and also make sure that the file can be accessed through urls. Urls used to access can be either hdfs:// or http://. So now that the files are present on the above mentioned urls, it is considered (user command) as the cache file for the distributed cache.

Difficulty Level

14. Why does one remove or add nodes in a Hadoop cluster frequently?

The most important feature of Hadoop is to use commodity hardware to run the applications but this however leads to frequent data node crashes in Hadoop cluster.

In Hadoop we have the advantage of scaling up the data nodes to store high volumes of data.

So for the above two reasons administrator has to frequently add or remove data nodes in Hadoop cluster.

15. What is “throughput” in Hadoop?

Throughput is defined as the amount of work done in a unit of time. Hadoop has a high throughput because of the following reasons;

In Hadoop HDFS it is data Write Once and Read Many model. Data written once cannot be modified so this simplifies the data coherency issues. Hence, providing high throughput data access.

Hadoop & HDFS also works Data Locality principle. It is the computation which is moved to the data location instead of moving the data to the computation location. This reduces the network congestion resulting in high system throughput.

16. How to restart NameNode or all the daemons in Hadoop?

 A NameNode can be restarted using the following methods:

  • NameNode can be stopped individually using /sbin/hadoop-daemon.shstop namenode command. Then the NameNode can be started using /sbin/hadoop-daemon.sh start namenode.
  • To stop all the daemons first and to restart it again use  /sbin/stop-all.shand then use /sbin/start-all.sh, commands.

These script files are stored inside the sbin directory inside the Hadoop directory store.

17. What does JPS command do in Hadoop?

JPS is a command to check whether all the Hadoop daemons like NameNode, DataNode, Resource Manager, Node Manager etc are running or not. It will report out all the Hadoop daemons running on that machine.

18. What fsck command does in Hadoop?

It is the File System Check command, used to check the inconsistencies of HDFS files and report out their problems like missing blocks for a file or under-replicated blocks. It does not correct the issues it detects.

The File system check provides an option to select all files or ignore open file during reporting. Most of the recoverable failures will be automatically corrected by NameNode. HDFS fsck is not a Hadoop shell command. It can also run as bin/hdfs fsck. It can check run on whole file system or on a subset of files.

19. What Hadoop Streaming?

Hadoop streaming is an utility which permits user to write MapReduce programs in any language to run a Map or Reduce job. The program should read a standard input and write standard output which can be used for Map and Reduce tasks. As we all know the core architecture of Hadoop is to have a mapper and reducer. Hadoop streaming supports the following languages Python, Ruby, PHP, Pearl, bash etc. Hadoop previous versions were supporting only text processing whereas latest versions of Hadoop support both binary and text files.

What does the above written mapper and reducer python script does? It reads the input from the standard input (line by line) and produces the output to the standard output. So streaming is a utility which creates a MapReduce job.

When an executable or a script is specified for mappers, mappers will start launching the mapping process. As we have learnt from our previous topics, there will be multiple mappers available on slave nodes, so each mapper will launch the script as a separate process when the mapper is initialised. Reducer works almost similar to Mapper. When an executable or a script is specified for reducers, it will start launching the reducer process. As we have multiple reducers in the MapReduce framework a separate reduce process will be launched by each Reducer when the reducer is initialised.

20. How to debug Hadoop code?

Check the list of currently running MapReduce jobs. If orphaned jobs are running then you need to determine the location of RM logs.

  • First of all, Run: “ps –ef| grep –I ResourceManager” and then, from the displayed result look for log directory. From the displayed list find out the job-id.
  • Then any error messages associated with that job should be checked.
  • Now, identify the worker node which involves in the execution of the task with the help of RM logs,.
  • Now, login to that node and run- “ps –ef| grep –I NodeManager”
  • Then the Node Manager log should be examined.
  • For each amp-reduce job, the majority of errors come from user level logs.

21. What is HDFS – Hadoop Distributed File System?

HDFS is primarily used by Hadoop applications for distributed storage. It is the world’s most reliable storage system. Basic purpose of HDFS is to divide huge amount of data and store it in multiple machines and provide easy access to those data sets. HDFS makes sure that no data is lost in case of system failure and makes the applications available for parallel processing. HDFS is completely written in JAVA programming and it is based on Google File System (GFS).

As we have seen earlier in our previous topics “Hadoop architecture”. HDFS works on Master-Slave principle. A HDFS cluster primarily consists of a

Master – Name Node that manages the file system  metadata.

Slave – Data Nodes that stores the actual data.

HDFS has demonstrated a potential to store up to 200 PB of data storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is writing-once- reading many models that enables high-throughput access.

22. Explain NameNode and DataNode in HDFS?

Master Node (Name Node):

The Client contacts HDFS Master- Name Node to access the files in cluster. Name Node has the Meta data. Since Name Node is the master in the cluster it should be deployed on the reliable hardware.

  • It takes care of client authentication, space allocation to actual data and details about actual storage location etc.
  • Name node also maintains slave node, assign task to them and had a track of slave node performance failure etc.,

Slave Node (Data Node):

Data storage in HDFS is managed by number of HDFS Slave-Data Nodes. Slave-Data nodes are actually worker nodes which will do the assigned works by master node. Data Nodes can be deployed on commodity hardware and need not to be deployed on very reliable hardware since the data in the slave nodes are replicated in other data nodes. So in case of failure in one hardware data can be retrieved from other data nodes placed in different hardware.

  • Slave data node can perform read and write request from the file system’s client.
  • It can also perform block creation, deletion and replicating the created block as many number instructed by Master Name Node.

23. Explain HDFS Data Storage?

HDFS breaks the file to be written into small piece of data known as Blocks. The default block size is 128 MB, which can be increased as per the requirement.  These blocks are stored in the cluster in distributed manner on different nodes and the blocks stores are also replicated to multiple data nodes. This replication of stored data is done to avoid any data loss in case one data node failure. This provides a mechanism for MapReduce (we will discuss in future) to process the data in parallel in the cluster. HDFS splits large file into N number of small blocks and stores it in different node across the cluster in distributed manner. By default HDFS replicate the each block 3 times across different nodes in the cluster.

24. What is Rack Awareness in HDFS?

As we can see from the above picture Hadoop contains multiple data nodes in a cluster of computers which are commonly spread across many racks. To improve fault tolerance HDFS Master-Name Node places replica of blocks of data in multiple racks. HDFS Master-Name Node places at least one replica in one rack, even though a complete rack got crushes blocks will be available from other rack. So the purpose of Rack replica is to increase the reliability and availability of the data stored.

25. What are the main features of HDFS?

 The main features of HDFS are as follows;

  • Distributed Storage
  • Blocks
  • Replication
  • Fault Tolerant
  • High throughput access to application data
  • High Availability, Reliability & Scalability

26. Explain Distributed Storage in HDFS?

 HDFS is to help the client to divide large volume of data into small blocks of data and distribute it to store in multiple machines/data nodes and also provide easy access to those data sets. Due to its replication of data storage in multiple data nodes, loss of data is avoided even in case of failure in one data node or in one machine. HDFS storage system enhances MapReduce to work on its full potential.

27. Explain Blocks in HDFS?

As seen before in distributed storage, HDFS splits large volume of data into small pieces. Those split small pieces of data is referred as blocks in Hadoop. Name Nodes has complete control over these blocks, it allocates location by deciding which Data Node these blocks to be stored. HDFS default block sizes are 128MB. It can be altered based on requirements. If a client requests HDFS to store a data of 140 MB, then the data will be stored in 2 blocks, one block with 128MB and another with 12 MB. So block size also depends upon the size of data. Instead of creating a second block with default 128MB, it has generated a block just for the size of 12MB.

So once the data is split into multiple blocks it is stored at different data nodes with a default of 3 replicas/duplicates of each. For example a block stored in data node 1 will also be copy pasted to data node 2 and data node 3. So its basically duplicating one block of data in three data nodes which makes it fault tolerant.

28. What is Replication in HDFS?

As the title indicates it is nothing but duplication of data. Every single data block which is stored in a machine/node will have          its duplicate copies at least in another 2 data node placed in different racks across the cluster. HDFS keeps creating replicas of user data on different machines present in the cluster. HDFS default replication factor is 3 and it can be altered for the requirements. The change in replication factor value can be done by editing the configuration files. The process of duplicating data blocks is called Replication.

Data Nodes are arranged in racks. HDFS has multiple racks containing data nodes. All the data nodes in a single rack are connected by a single switch so if a switch or a complete rack is down then the same data can be accessed from another rack. This is possible only because of the principle of replication of data across multiple racks.

29. How Hadoop becomes highly Available, Reliable & Scalable?

Availability -Storing data in multiple data nodes and racks makes it highly available.  Even when the machine or data node or network link goes down, data can be easily retrieved by the client as it is duplicated  at least in two other data nodes. This in turn makes HDFS stored data as highly available data.

Reliability – Replication is one big reason for HDFS storage being the most reliable. Even if hardware fails or a node crashes, the same data in the crashed node can be retrieved from another data node placed in another rack. So replication and high availability of data blocks makes HDFS as a highly reliable data storage system.

Scalability – It is nothing but increasing or decreasing the size of the cluster. So in HDFS scalability is done in two ways. The first method of scaling is by adding more disks on nodes of the cluster. Practically this is done by editing the configuration files to make the entries of the newly added disks. This method of requires a down time but may be very small. So because of the down time in first method, second method has become the most preferred method for scaling up which is called horizontal scaling. In this method more nodes are added to the cluster on the go without any down time. As many nodes can be added to the cluster on the go in real time. This is a unique feature provided by Hadoop.

” That’s all about the FAQ in Hadoop, i hope these questions help freshers to clear interviews”