Apache Spark Components – Spark Streaming
Introduction of Apache Spark Components – Spark Streaming
More than 50% of users consider Spark Streaming as one of the most important component of Apache Spark, It can be used to processing the real-time streaming data from different sources like Sensors, IoT devices, social networks, and online transactions. This is based on micro batch style of computing and processing.
Features of Apache Spark Components – Spark Streaming
* Ease of use
Spark streaming provides a bunch of APIs that helps to create streaming applications which is similar to batch jobs and it supports java,scala and Python.
* Fault Tolerance
Spark Streaming recovers both lost work and operator state e.g. sliding windows
It allows to reuse the same code for batch processing.
* Spark Integration
Spark Streaming Combine streaming with batch and interactive queries.
Working Procedure of Apache Spark Components – Spark Streaming
Figure: Working Process of Spark Streaming
- Spark Streaming Receivers receives Streaming data from Different Sources like Sensors, IoT(internet of things) devices, social networks, and online transactions.
- Then it breaks down Streaming data into little batches of input data.
- The Stream processing engines Process the data in parallel on a cluster.
- Output the Batches of processed results data pushed to external systems like HBase, Cassandra, Kafka,etc
Applications of Apache Spark Components – Spark Streaming
1. Streaming ETL
Here Data is continuously cleaned and aggregated before being pushed into data stores.
Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly. E.g. unusual behavior of sensor devices generating actions
Real time data is filled with more information by joining it with a static dataset allowing for a more complete real-time analysis.
4.Complex sessions and continuous learning
Events related to a live session e.g. user activity after logging into a website or application are grouped together and analyzed. In some cases, the session information is used to continuously update machine learning models.
Consider a word count example, It counts each word appearing in a document. Consider the following text as an input and is saved as an input.txt file in a home directory.
input.txt -input file.
|A farmer lives in a village.|
He grows crops and keeps animals.
He works very hard.
He gets up early morning and begins to work in the field.
He continues to work from morning to evening.
Following procedure given below to execute the word count example.
The following command is used to open spark shell. Generally, spark is built using Scala. Therefore, a Spark program runs on Scala environment.
Create an RDD
The below command is used for reading a file from the home directory.
|scala> val inputfile = sc.textFile(“input.txt”)|
Execute Word count Transformation
In this step we are counting the words in a file.
- Create a flat map for splitting each line into words (flatMap(line ⇒split(“ ”)).
- It will read every word as a key with a value ‘1’ (<key, value> = <word,1>) using map function (map(word ⇒ (word, 1)).
- Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).
- The following command is used for executing word count logic. This is not an action so it will not give output, because its a transformation.
|scala> val counts = inputfile.flatMap(line => line.split(” “)).map(word => (word, 1)).reduceByKey(_+_);|
Caching the Transformations
The below command is used to compute an action and store the intermediate transformations in memory.
Applying the Action
Following command used to save the output in a text file
From the above topic we can conclude that Spark Streaming is used to processing the real time streaming data from different data sources.