Apache Spark Features

1Swift Processing
  • Using Spark, we can achieve a high data processing speed of about 100 times faster in memory and 10 times faster on the disk.
  • Speed always matters for processing huge amount of data in the organizations to process voluminous data as fast as possible.
  • It reduces the number of read-write to disk.

2. Dynamic in Nature

  • By using Spark, It is easy to develop a parallel application because it provides 80 high-level operators.

3Usability

  • Spark supports multiple languages like java,scala,python and R with high-level APIs to write application quickly.

4. In-Memory Computation

  • Due to in-memory processing, it increases the processing speed. In spark the data is being cached. Hence need not to fetch data from the disk every time therefore the time is saved.
  • Spark has DAG execution engine which support in-memory computation, due to acyclic data flow, it resulting in high speed.

5. Re-usability

  • The Spark code can be reused for batch-processing and also to join stream against historical data or to run ad-hoc queries on stream state.

6. Fault Tolerance

  • The Spark provides fault tolerance through Spark abstraction-RDD
  • Spark RDDs are designed to handle the failure of any worker node in the cluster. Thus, it ensures that the loss of data is reduced to zero.

7Real-Time Stream Processing

  • Spark offers real-time stream processing, by overcoming the drawback of Hadoop MapReduce.
  • Hadoop MapReduce is able to handle and process data which is already present, but it does not support real-time data Processing hence Spark Streaming solve this problem.

8. Lazy Evaluation

  • All the transformations created in Spark RDD are Lazy in nature, means it does not give the result right away rather a new RDD is formed from the existing one. Thus, this increases the efficiency of the system.

9. Spark Compatibility with Hadoop and Existing Hadoop Data.

  • Spark is compatible with both versions of Hadoop ecosystem. i.e YARN (Yet Another Resource Negotiator) or SIMR (Spark in MapReduce).
  • It can run independently and suitable for Hadoop-MapReduce applications.

10Active, progressive and expanding community

  • Spark is used by wide set of developers from over 100 companies, because it has active mailing state and JIRA for issue tracking.
  • It is most active component in Apache repository.

Figure: Features of Spark

Conclusion

From the above topic of Apache Spark Features we can conclude that, it is the most advanced and popular product of Apache Community that provides all the access to work with the streaming data and it is build with various Machine learning library.It can also  work  on structured and unstructured data, deal with graph etc.

References

https://en.wikipedia.org/wiki/Apache_Spark