Apache Spark Components – Spark GraphX

Objective

The main objective behind Apache Spark Components-Spark GraphX creation is to simplify graph analysis task.

Introduction

          GraphX is a distributed graph-processing framework build on the top of Spark. It is a component for graph and graph-parallel computation. Its API used to perform graph analysis. It simplifies the graph analytics tasks by the collection of graph algorithm and builders. It also provides an optimized runtime.

Benefits

  • GraphX simplify graph analytics tasks by reusing Spark RDD concept to, and it operates on a directed multigraph.
  • It provides an API for fast and robust development for leveraging graphs.
  • GraphX is widely used in data analytics and computer science, because Graphs are the perfect data structure for describing social networks. For this reason, companies like Facebook emphasize developing software.
  • GraphX optimizes the way to represent vertex and edges when they are primitive data types.
  • GraphX supports fundamental operators like sub graph, join Vertices, and aggregate Messages.

Features

  1. Flexibility
  • Spark GraphX works with graphs and computations.
  • GraphX unifies ETL (Extract Transform & Load).
  • Spark GraphX is an API designed to manipulate graphs.
  • It performs exploratory analysis and iterative graph computation within a single system. Therefore we can view the same data has graphs and collections, transform and join graphs in case of RDDs efficiently.
  1. Speed
  • Spark GraphX has the ability to combine transformations, machine learning, and graph computation in a single system at high speed, it makes Spark as one of the most powerful frameworks .
  1. Growing Algorithm Library
  • Spark GraphX has number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter.

Understanding Apache Spark Components-Spark GraphX with an Examples

Figure: Flight Example with GraphX

As per the above diagram, a flight travels to three different places namely SFO, ORD and DFW and the distances between these locations are labeled accordingly.

GraphX is implemented  to analyze the flight routes. In Spark GraphX all the locations are called as Vertex (V) and all the connecting routes are called as Edge (E).

Use Cases of Graph Computation

The following are the use cases of Apache Spark Components-Spark GraphX, it give an idea about graph computation and scope to implement new solutions using graphs.

1.Disaster Detection System

Graphs can be used to detect disasters such as earthquakes, tsunami, forest fires and volcanoes so that it provides warnings to alert people.

2.Page Rank

Page Rank can be used in finding the influencers in any network like social media network.

3.Financial Fraud Detection

It is can be used to detect people involved in financial fraud and money laundering and also to monitor financial transaction.

4.Business Analysis

Graph is also used in business Analysis to understand customers purchase trends supports. E.g. Uber, McDonald’s etc.

5.Geographic Information Systems

Graphs are used to develop functionalities on geographic information systems like watershed delineation and weather prediction.

Use Case – Flow Diagram

Figure: Use Case – Flow diagram of Flight Data Analysis using Spark GraphX

The Steps Involved in Flight Data Analysis Using Spark GraphX are as fallows

1. Collecting Huge amount of Flight data

2. Database Storing Real time Flight data.

3. Creating Graph Using GraphX.

4. Querying the data like

4.1 Compute Longest Flight Routes

4.2 Calculate Top Busiest airport

4.3 Calculate routes with lowest Flight Cost.

5. Visualizing using Google Data Studio.

6. Final Step is getting Specific Results.

Conclusion

          From the above topic we can conclude that Spark GraphX is a component for graph and graph-parallel computation, by using its API it simplifies the graph analysis. It is a boon for data scientist to analyze real-time data.

References

https://en.wikipedia.org/wiki/Graph_theory