Hadoop YARN Architecture

You have already got the idea behind the YARN in Hadoop 2.x. Let’s come to Hadoop YARN Architecture.

The basic components of Hadoop YARN Architecture are as follows;

  • Resource manager (one per cluster) – Master
  • Node manager (one per data node)  –  Slave
  • Application Master (one per Application or Job)

Yarn has a dedicated independent machine called Resource manager. The main idea of yarn is to negotiate resources. Every Slave – data node has daemon called Node manager.

In YARN also we have master machine and a slave machine.

Hadoop YARN Architecture

Hadoop YARN Architecture

Details of Resource manager, Node manager & Application Master in Hadoop:

  • Data are stored in data node 2 & 3.
  • You submit a job to resource manager. Resource manager is connected to all the nodes. Resource manager will contact any node manager for running the job.
  • It is not necessary for the resource manager to contact where the data to be processed is saved. Remember our data is stored in data node 2 & 3 but resource manager contacted node manager in data node 1 because it can contact any data node irrespective of where the data blocks to be processed are stored.
  • Now that the resource manager contacted data node 1, the Node manager in data node 1 will launch a daemon called application master/app master. It is the responsibility of the application master to run the job.

There will be one dedicated application master launched for every single job. If i have submitted three jobs then three app masters would have been launched.

  • Then the application manager will either contact a node manager or a resource manager. By contacting the resource manager the app master will come to know that the data is actually stored in data node 2 & 3.
  • The App master after knowing from the resource manager that the actual data is stored in data node 2 & 3. It will contact those two data nodes and launch something called container in both the data nodes.
  • A container is a simple java process or a JVM. So inside the container job /program will get executed.
  • Now if it is a MapReduce program, the Map job and Reduce job and the MapReduce join will be executed inside the container.
  • When a job is running in a data node and if it needs more resources like RAM or more processing power the App Mater will contact the resource manager and ask to allocate the resources. So resource manager is a global entity which monitors the jobs running entire cluster and allocating the required resources.

Node manager

Node manager has one more responsibility, Node manager in a data node will continuously monitor the resources in its data node like used RAM size, total RAM capacity, unused RAM and storage space availability then sends the information or status back to the Resource Manager.

It is part of the Resource manager. It schedules jobs based on first in first out but which is not the best way to schedule the process. So there is a fair scheduler and a capacity scheduler which does the job better than FIFO scheduler.