Sqoop 1 and 2 Architecture along with Working

Let’s study about Sqoop 1 and 2 Architecture along with Working,

Sqoop 1 Architecture

 

* Sqoop provides command line interface to the end-user, using commands it performs both import and export of data.

* Using Java API also we can access sqoop service.

* Map phase alone is sufficint in Sqoop to perform both import and export, reduce phase is not required here because data aggregation is not a part of sqoop.

* Sqoop mainly does two functions,

  1. Import
  2. Export

Working

Import: When end-user submit the Sqoop command for import, firstly it prepares Map job, then launches the multiple mappers depends on the number of mappers defined by user in the command line, then Sqoop distributes the input data among the mappers equally to get high performance. Then using JDBC each mapper creates connection with the database and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or HBase based on the option provided in the command line.

Export: It is also similar to import only, when end-user submit the Sqoop command for export, firstly it prepares with map job, Map Tasks will brings the chunk of data from HDFS. Combining all these chunks of data, we receive the whole data at the destination i.e RDBMS (MYSQL/Oracle/SQL Server).

Sqoop 2 Architecture

* The main design goal of Sqoop 2 are,

  1. Ease of use
  2. Ease of Extension
  3. Security

* Sqoop 2 supports for both command line interaction and web-based GUI for end-user to perform both import and export.

* The UI (user interface) is built on top of a REST API that can be used by a command line client exposing similar functionality and it can be easily integrated with other systems.

* Sqoop connectors are the main extension points for Sqoop.

* In this architecture connectors and drivers are managed centrally in one place and Connectors can be non-JDBC based.

*  Sqoop 2 build with security mechanism

i.e  Here administrators create maximum number of connections with necessary resources such that end-users can use these predefined connection objects without requiring access to sensitive connection information.

* It is well configured and integrated with Oozie for interoperability and management.

* Here Users can operate Sqoop from a remote host using a web browser or command line.

* In Sqoop 2 along with map phase it uses reduce phase for,

  1. It uses connectors to check the connectivity( i.e Sqoop functionality is uniformly available for all connectors)
  2. It used to transport data formats
  3. Perform Hive/HBase integration

Conclusion

From this article we can conclude that, some of the drawbacks of sqoop 1 are resolved in the sqoop 2. The beginners can easily interact with sqoop 2 using GUI.

References

https://en.wikipedia.org/wiki/Sqoop

https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

 

That’s all about the Sqoop 1 and 2 architecture, to become master in sqoop, fallow our next article Sqoop commands.