Hadoop Distributed Cache

Let’s study about Distributed Cache in Hadoop,

Overview

In this part of the tutorial we are going to understand,

  • What is Distributed Cache in Hadoop?
  • What are its features?

This topic will cover in detail about Working and Implementations of distributed cache in hadoop framework. It also covers advantages and disadvantages of Distributed Cache.

Introduction

Before we move on to Distributed Cache we should refresh Hadoop framework from our previous topics.

Hadoop framework works on the principle of Master-Slave architecture. A single Name Node will store the meta-data information and the multiple data Nodes will store the actual data need to be processed.

Now we will understand what the role of distributed cache is in the framework.

* In hadoop data chunks are processed in parallel, and stored in multiple Data Nodes using a user written program.

* Files from all data nodes which a user wants to access during the execution of the application will be kept in the distributed cache.

Distributed Cache

* Distributed Cache is a facility provided by Hadoop MapReduce framework to cache/access files which are needed by the application during its execution.

* It can cache the read only text files, jar files, archives etc. Once a file is cached then it is copied and placed in the local system of the every slave node before the MapReduce tasks are executed on that node. This will increase the performance of the task by saving the time and resource required for input/output operations.

* In some cases it is necessary for every Mapper to read a particular file which can be now done by reading it from cache.

Distributed Cache

 

Working and Implementation

For the application in order to use the distributed cache, the required file need to be moved to distributed cache. We should make sure that the file is available to access and also make sure that the file can be accessed through urls.

Note: Urls used to access can be either hdfs:// or http://

So now that the files are present on the above mentioned urls, it is considered (user command) as the cache file for the distributed cache. Then the MapReduce job will copy the cached file to all the slave nodes before starting that particular task on that particular node. While the job is at execution, cache file should not be modified by the application or externally.

Hadoop distributed cache has a default size of 10GB. It is possible to control the size of the distributed cache using marped-site.xml configuration file.

The Process is as Follows

1.Copy the requisite file to the HDFS
$ hdfs dfs-put/user/beyondcorner/lib/jar_file.jar
2. Setup the application’s JobConf
DistributedCache.addFileToClasspath (new Path (“/user/beyondcorner/lib /jar-file.jar”), conf)
3. Add it in Driver class

Private and public Distributed Caches

Distributed file cache can be classified into two types which are private and public. It is their type which determines how they can be shared on slave nodes.

1.Private distributed cache

* As name suggests, the distributed cache files are cached in a local directory which is then private to the user ad his jobs which needs these files.

* These files can be shared by all the tasks which are specific to that particular user.

* It cannot be used for jobs of other users on the slaves.

* A distributed cache to be deemed as private, the permissions need to be set only for that user’s task. Based on the read and write permissions, the file with no world readable access or the directory leading to the file with no world executable access and then the file become Private.

2. Public distributed cache

* As its name suggests, the distributed cache files are cached in a global directory and their credentials are setup in such a way that they are publically visible to all users.

* These files can be shared by all the tasks of all the users on the slave nodes.

* A distributed cache to be deemed as public, the permissions need to be set for global user’s task. Based on the read and write permissions, the file with world readable access or the directory leading to the file with world executable access, then the file becomes Public.

Advantages of Distributed Cache

Here we listed few of the most significant benefits of using distributed cache

1. It distributes complex data types like jars and archives. After distribution at the slave node these archives can be un-archived. It also distributes simple and read only text files.

2. In Hadoop Distributed Cache, modification timestamps of cache files are tracked. This is to notify that the there should not be any change in the files while the job is in execution.

3. The cache engine can always determine on which node a particular key-value pair resides using the hashing algorithm. There is always a single state of the cache cluster, which makes it consistent.

4. Single point of failure – Distributed cache runs across many nodes as an independent process. Failure in any one data node does not result in the failure of complete cache.

Disadvantages of Distributed Cache

The distributed cache has few disadvantages too which are listed below

Object Serialisation:

In distributed cache objects need to be serialised. Two major problems of distributed cache lie in this serialisation mechanism.

  • Very slow – Reflection technique is used in serialisation to inspect the type of information at runtime. Pre-compiled code does it faster when compared to the slower reflection process.
  • Very Bulky – Serialisation in terms of space and storage is considered to be bulky since it stores multiple data like class name, cluster, assembly details and also stores reference to other instance in member variables.

Conclusion

Hadoop MapReduce framework supports distributed cache mechanism. Distributed cache in Hadoop is used to broadcast small or moderate sized files (read only) to all the worker nodes. Once the job runs successfully, distributed cache files will be deleted from worker node.