Apache Spark Components – Spark Streaming

Introduction of Apache Spark Components – Spark Streaming

More than 50% of users consider Spark Streaming as one of the most important component of Apache Spark, It can be used to processing the real-time streaming data from different sources like Sensors, IoT devices, social networks, and online transactions. This is based on micro batch style of computing and processing.

Features of Apache Spark Components – Spark Streaming

Ease of use

Spark streaming provides a bunch of APIs that helps to create streaming applications which is similar to batch jobs and it supports java,scala and Python.

Fault Tolerance

Spark Streaming recovers both lost work and operator state e.g. sliding windows


It allows to reuse the same code for batch processing.

* Spark Integration

Spark Streaming Combine streaming with batch and interactive queries.


Working Procedure of Apache Spark Components – Spark Streaming


Figure:  Working Process of Spark Streaming

  • Spark Streaming Receivers receives Streaming data from Different Sources like Sensors, IoT(internet of things) devices, social networks, and online transactions.
  • Then it breaks down Streaming data into little batches of input data.
  • The Stream processing engines Process the data in parallel on a cluster.
  • Output the Batches of processed results data pushed to external systems like HBase, Cassandra, Kafka,etc

Applications of Apache Spark Components – Spark Streaming

1. Streaming ETL

 Here Data is continuously cleaned and aggregated before being pushed into data stores.


Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly. E.g. unusual behavior of sensor devices generating actions

3.Data enrichment

Real time data is filled with more information by joining it with a static dataset allowing for a more complete real-time analysis.

4.Complex sessions and continuous learning

Events related to a live session e.g. user activity after logging into a website or application are grouped together and analyzed. In some cases, the session information is used to continuously update machine learning models.


Consider a word count example, It counts each word appearing in a document. Consider the following text as an input and is saved as an input.txt file in a home directory.

input.txt -input file.

A farmer lives in a village.

He grows crops and keeps animals.

He works very hard.

He gets up early morning and begins to work in the field.

He continues to work from morning to evening.

 Following procedure given below to execute the word count example.

Open Spark-Shell

The following command is used to open spark shell. Generally, spark is built using Scala. Therefore, a Spark program runs on Scala environment.

$ spark-shell    

Create an RDD

The below command is used for reading a file from the home directory.

scala> val inputfile = sc.textFile(“input.txt”)

Execute Word count Transformation

In this step we are counting the words in a file.

  • Create a flat map for splitting each line into words (flatMap(line ⇒split(“ ”)).
  • It will read every word as a key with a value ‘1’ (<key, value> = <word,1>) using map function (map(word ⇒ (word, 1)).
  • Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).
  • The following command is used for executing word count logic. This is not an action so it will not give output, because its a transformation.
scala> val counts = inputfile.flatMap(line => line.split(” “)).map(word => (word, 1)).reduceByKey(_+_);

Caching the Transformations

The below command is used to compute an action and store the intermediate transformations in memory.

scala>  counts.cache()

Applying the Action

Following command used to save the output in a text file

 scala> counts.saveAsTextFile(“output”)


From the above topic we can conclude that Spark Streaming is used to processing the real time streaming data from different data sources.