Apache Spark Architecture

Overview of Apache Spark Architecture

Spark is a top-level project of the Apache Software Foundation, it support multiple programming languages over different types of architectures. Spark’s features like speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time. The project supporting Spark’s ongoing development is one of Apache’s largest and most vibrant, with over 500 contributors from more than 200 organizations responsible for code in the current software release.

It is very important to know the Apache Spark Architecture to understand the working of Spark clearly. Following diagrams give a clear view on it.

There are three ways Spark can run is explained below

1. Standalone

Here Spark is build on the top of HDFS(Hadoop Distributed File System) and remaining space is allocated for HDFS. Here Spark can run parallel with MapReduce.

2. Hadoop Yarn

Here Spark runs on the top of Yarn without any pre-installation. It helps to integrate Spark into Hadoop ecosystem . It allows other components to run on top of stack.

3Spark in MapReduce (SIMR)

In the absence of YARN, we can also use Spark along with MapReduce. This reduces the burden of deployments.

Figure: Spark architecture model.

Spark Architecture

It includes mainly three components, they are

  1. Data Storage
  2. API
  3. Management Framework

Data Storage:

  • The Spark uses HDFS(Hadoop distributed file system) for data storage purposes.
  • Spark can works with any Hadoop compatible data source including HDFS, HBase, and Cassandra, etc.


  • Spark API helps the developers to create Spark based applications and also provides API for Java, Scala and Python programming languages.

Resource Management:

  • Spark can be a Stand-alone server or distributed computing framework like YARN.


From the above topic we can conclude that Apache Spark is being an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms.