Apache Spark Components – SparkR

Objective

The main idea behind Apache Spark Components – SparkR is to support large-scale analytics by using R package and also to improve the scalability of Spark.

Introduction

          SparkR is a R packge, it combines the advantages of both Spark and R,it provides a light-weight frontend to use Apache Spark. It supports distributed machine learning using MLlib. The key component of SparkR is DataFrame.

Figure: Combination of Spark and R

DataFrame

  • It is a fundamental data structure for data processing in SparkR.
  • It supports operations like selection, filtering, aggregation etc.

Advantages of SparkR

  • SaprkR Data Source API Can read data from Different sources For example, Hive tables, JSON files, Parquet files etc
  • SparkR Data frame is used as Computation engine for code generation and memory management.
  • SparkR is Scalability in nature because, Data Frames distributed across Spark cluster. Therefore it can run terabyte of data.
  • SparkR Improve the Performance of apache Spark.
  • SaprkR used in different areas such as time series forecasting and web analytics.
  • Using SparkR, programmers and data scientists can transform R into a tool for big data analytics, by taking the advantage of parallel processing.
  • SparkR performs lazy evaluation on DataFrame operations.

Spark Architecture

Figure: SparkR Architecture

  • SparkR Architecture consists of two main components they are,
  1. Bridging R and JVM
  2. Spawning R workers
  • The connection between R to JVM  on the driver that allows R programs to submit jobs to Spark cluster and support for running R on the Spark executors.
  • In SparkR operations run on the DataFrames are automatically distributed across all the nodes available on the Spark cluster.
  • Socket-based API used to invoke functions on the JVM from R, They are supported across the platforms like java and R. They are available without using any external libraries in both languages.
  • The cost of using sockets is less compare to other in-process communication.
  • SparkR JVM Supports two types of RPCs they are,
  1. Method invocation
  2. Creating new objects.
  • The second part of SparkR is designed to launch R processes on Spark executor machines.
  • SparkR automatically serializes the necessary variables to execute a function on the cluster.

Use Cases of SparkR

  1. SparkR is used to perform data cleaning, aggregation and sampling using cluster resources rented in the cloud.
  2. SparkR will appeal to any R users who are constrained by single-threading or lack of memory on local resources.
  3. SparkR can handle the pre processing of the data to generate the training features, giving the labels as input to a machine learning algorithm.
  4. SparkR to run partition aggregate workflows could dramatically increase the speed in such scenarios.

Figure: Use Cases Of SparkR

Conclusion

          From the above topic we can conclude that, SparkR is an R packge, it combines the advantages of both Spark and R to improve the scalability of Spark and used in Large-scale analytics.

References

https://spark.apache.org/docs/latest/sparkr.html