Spark RDD (Resilient Distributed Datasets)
The main objective of RDD is to achieve faster and efficient MapReduce operations in spark.
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable in nature. RDD is a read-only and partitioned collection of records, Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type user-defined classes such as Python, Java, and Scala objects.
RDD(Resilient Distributed Datasets) means,
Resilient – Means fault-tolerant with the help of RDD lineage graph(DAG) ,It is able to recomputed missing or damaged partitions due to node failures.
Distributed – Means Data resides on multiple nodes.
Datasets – It represents records of working the data. The user can load the data set externally from JSON file, CSV file, text file or database via JDBC without any specific data structure.
There are two ways to create RDD. They are,
- Parallelizing– Already exist in the driver program.
- Referencing– The dataset exist in the external storage system like shared file system, HDFS, and HBase.
- RDD improve performance by keeping data in-memory.
- RDD provides fault tolerance efficiently, by defining a program interface.
- RDD saves lots of time and improves efficiency, because it is called when needed.
- RDD provides Interactive data mining tools and Iterative algorithms.
- RDDs are immutable in nature that cannot modify the content of RDD, So that level of consistency is high.
Features of RDD
- Fault Tolerance
- Spark RRD has the capability to operate and recover loss after a failure occurs.
- It rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets to recreate itself.
- In-memory Computation
- Spark RDDs have a feature of in-memory computation. It stores intermediate results RAM instead disk.
- Lazy Evaluations
- In Apache Spark all transformations are lazy, right away it will not compute the results. Instead of that just it will remember the transformations applied to some of the base data set.
- Apache Spark computes transformations when an action requires a result for the driver program.
- Once RDD created cannot be changed because of read only abstraction
- Here RDD can be transformed one form to another RDD using transformations like map, filter, join and co-group.
- Immutable nature of RDD Spark helps to maintain level of high consistency.
- In Spark RDD Partitioning is the fundamental unit of parallelism.
- Here each partition is one logical division of data which is mutable in nature.One can create a partition through some transformations on existing partitions.
- Partitions of an RDD are distributed across all the nodes in a network.
- In RDD persistence nature does fast computations.
- Here users have the option of selecting RDD for reuse and they can also select storage either disk or in-memory to store data.
- In Spark RDD it process data in parallel
- Coarse-grained Operations
- These operations are applied to all elements in datasets. Like map, filter,and groupBy operation.
- Spark RDD have various types of datatypes like int,long,string.
- Spark RDD has the choice of defining placement preference to compute partitions. i.e information about the location of RDD.
Figure: Features of Spark RDD
RDD offers two types of Operations. They are,
- Transformations are the functions that take an RDD as an input and produce one or more RDDs as an output.
- Transformation cannot change input RDD because RDDs are immutable in nature.
- Transformation creates a new dataset from an existing one.
- There are two kinds of transformations i.e Narrow transformation and Wide transformation.
- An Action in Spark returns final result of RDD computations.
- An Action in RDD operations produces non-RDD values.
- An Action is a one of the method to send result from executors to the driver.
- Action stores its value either to drivers or to the external storage system
Note: Transformation used to create RDD from the existing one, but in order to work with datasets Action is needed.
There is also some limitation of Apache Spark RDD. They are,
- RDD does not support for automatic optimization.
- RDD does not support for run-time type safety.
- RDD has Storage limitation.
- RDD fails to handle structured data.
- RDD has Performance limitation.
Note: Spark data frames resolve all the drawbacks of RDD.
From the above topic we conclude that RDDs are Immutable in nature and portioned collections of objects spread across cluster.It stored in RAM or on disk and built through lazy parallel transformations, with automatically rebuilt on failure.