Apache Spark Components-Apache Spark SQL

1. Objective

The goal of Apache Spark Components-Apache Spark SQL is to overcome the drawbacks of Apache Hive .Main drawback of hive is while processing the queries,struck in between it cannot come out of it.so Spark SQL came exist to replace.

2. Introduction

Spark SQL is an Apache Spark component, run on the top of Apache Spark. It is a programming module designed to process structured data. It supports programming abstraction called DataFrame and it act as distributed SQL query engine. Spark SQL provides interface, using this we acess more information about the structure of the data and the computation performed. Two ways used to interact with Spark SQL they are,

1. DataFrame API

2. Dataset API

Above types are explained in the Architecture. Spark SQL execution engine is used irrespective of API or language like scala ,java and python. Presence of Spark SQL makes Apache Spark is accessible to more users and it improves the optimization. Spark SQL introduces optimizer called Catalyst, it supports a wide range of data sources and algorithms in Big-data.

2.1 Apache Spark SQL use structured and semi-structured data in three ways

2.1.1 DataFrams are used to simplify the working with structured data,it provides DataFrame abstraction in Python, Java, and Scala.

2.1.2 Data can be read and written in different structured formats like, JSON, Hive Tables, and Parquet.

2.1.3 Spark SQL perform querying data in Spark program,with in the program and from external tools i.e standard database connectors JDBC or ODBC to Spark SQL.

Note:  The best way to use Spark SQL is inside a Spark application.

3. Apache Spark SQL Architecture

Figure: Apache Spark SQL Architecture


3.1 Data Source API

  • Data Source API is a universal API for loading and storing structured data.
  • It has built in support for Hive, Avro, JSON, JDBC, Parquet, etc.
  • Spark packages supports third party integration.
  • examples of Data Source API are Parquet file, JSON document, HIVE tables, and Cassandra database.

3.2 Data Frame API

  • Data frame API is a distributed collection of data in the form of named column and row.
  • It provides Data Abstraction and DSL (Domain Specific Language) for structure and semi structured data.
  • These processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi-node clusters.
  • It supports different data formats like Avro, CSV, Elastic Search and Cassandra and storage systems for example HDFS, HIVE and MySQL.
  • Dataframe Provides API for Python, Java, Scala, and R Programming is shown in the below diagram.
  • Data frames easily integrated with all Big Data tools and frameworks using Spark-Core.

3.3 SQL Interpreter And Optimizer

  • It is a technically evolved component of SparkSQL, used performs analysis/evaluation, optimization, planning, and run time code spawning(support multiple process and threads).
  • These supports queries to run much faster than RDD (Resilient Distributed Dataset).
  • It supports cost based optimization i.e run time and resource utilization and rule based optimization i.e  set of rule to fallows to execute the query.

3.4 SQL Service

  • It supports to work with structured data in Spark.
  • It supports the creation of DataFrame and execution of SQL queries.

4. Features of Apache Spark SQL

4.1 Integrated

  • Apache Spark SQL integrated SQL queries with Spark programs to query structured data as RDD.
  • It integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries in complex analytic algorithms.

4.2 Unified Data Access

  • Apache Spark SQL can load and query data from different sources like Hive, Avro, Parquet, ORC, JSON, and JDBC
  • Here it joins the data across different sources to accommodate all the existing users into Spark SQL.

4.3 Hive Compatibility

  • Apache Spark SQL support to run unmodified Hive queries on existing warehouses.
  • Spark SQL reuses Hive MetaStore and frontend  to give full compatibility with the existing Hive data,user defined function,and queries.

4.4 Standard Connectivity

  • Apache Spark SQL connects through JDBC or ODBC.
  • It fallows industry standards of JDBC and ODBC connectivity with server mode.

4.5 Scalability

  • Spark SQL uses same engine to process interactive queries(Live Long and Process).
  • By taking the advantage of RDD model it support mid-query fault tolerance and large jobs.

4.6 Performance Optimization

  • Optimization engine converts each SQL query to logical plan then further converts into too many physical execution plans.
  • Among the entire plan, it selects the most optimal physical plan for execution.

5. Functions defined by Apache Spark SQL

5.1 Built-In function defined by Apache Spark Components-Apache Spark SQL used to process the column value.Below command is used to import inbuilt function.

   Import org.apache.spark.sql.functions

5.2 Spark SQL defines User Defined Functions on the user defined functions in scala.

The below example defines user defined function to convert a given text to upper case.

  • Creating “beyond corner” dataset
val dataset = Seq((0, “beyond”),(1, “corner”)).toDF(“id”,”text”)
  • Defining a function ‘upper’ to converts a string into upper case.
val upper: String => String =_.toUpperCase
  • import the ‘udf’ package into Spark
import org.apache.spark.sql.functions.udf
  • Defining created UDF, ‘upperUDF’ and importing created function to ‘upper’.
val upperUDF = udf(upper)
  • Display the results of created User Defined Function in a new column ‘upper’.
dataset.withColumn(“upper”, upperUDF(‘text)).show


5.3 Spark SQL defines Aggregate functions to operate on a group of rows and to calculate a single return value per group.

6. Uses of Apache Spark SQL

  • It is used to  executes SQL queries.
  • Supports reading data from existing Hive installation.
  • when SQL run in the another programming language obtained result is Dataset or Dataframe.

7. Disadvantages of Apache Spark SQL

• Apache Spark SQL does not support to create or read a table containing union fields.
• Spark SQL does not support for Hive transactions.
• This does not support for time-stamp in Avro table.
• It does not support for Char type.
• Spark SQL does not display error for oversize of varchar type.



        From the above topic we conclude that, it is a component of Apache Spark Components-Apache Spark SQL used to analyses the structured data. Apache Spark is accessible to more users from Spark SQL. It provides scalability and it ensures high compatibility of the system.