Sqoop Data Compression Techniques

Let’s study about Sqoop data compression techniques,

Introduction      

Bigdata Hadoop is mainly used for storage and processing huge data set of range tera and peta byte. To store huge data set we need bigdata, to reducing the storage size we are using data compression technique.

Data compression is the method of modifying and converting the bits structure of data in such a way that it consumes less space on disk. It enables reducing the storage size of one or more data instances or elements.

Codec snappy                                        

It is a technique used in Sqoop to reduce data size for storage.

In codec snappy there are two types compressions are there,

  1. avrodatafile
  2. sequencefile

1. Avrodatafile compression

In this method we are compressing the “emp” table data into hadoop storage using Sqoop, it reduces half of the size of the data.

Example

Let us consider a table “emp” in “beyondcorner” database of size 6.7 Gb. After applying the Avrodatafile compression it reduce to 4.7 Gb.

$ sqoop import –connect jdbc:mysql://localhost/beyondcorner

–table “emp”

–username “root”

— password “beyonduser”

-m 1

–as-avrodatafile

–target-dir ‘/home/beyond-corner/empdata’

–compression-codec snappy

–driver com.mysql.jdbc.Driver

Verification                                                               

Below command is used to verify compress data in hadoop.

$ hadoop fs –cat /home/beyond-corner/empdat/part-m-00000.avro

2. Sequencefile compression

In this method we are compressing the “beyondemployee” table data into hadoop storage using Sqoop, it reduces 45% of the size of the data.

Example

Let us consider a table “beyondemployee” in “beyondcorner” database of size 6.7 Gb. After applying the sequencefile compression it reduce to 4.9Gb.

$ sqoop import –connect jdbc:mysql://localhost/beyondcorner

–table “beyondemployee”

–username “root”

–password “beyonduser”

-m 1

–as-sequencefile

–target-dir ‘/home/beyond-corner/beyondempdata’

–compression-codec snappy

–driver com.mysql.jdbc.Driver xample

Verification                                                                                                   

Below command is used to verify compress data in hadoop.

$  hadoop fs –cat /home/beyond-corner/beyondempdat/part-m-00000      

Note: When ever we need to save the huge data set for future usage ( may be after 2 or 3 years) in the time we can use compression techniques to save storage space.

Conclusion

Codec snappy is a best Sqoop data compression technique used in the bigdata hadoop to reduce the storage size.

Thats all about the sqoop data compression techniques, we can easily adopt in our projects.