Twitter Case Study Apache Pig

Let’s discuss about the Twitter Case Study Apache Pig.

Introduction

We all know that Twitter is an online news and social networking service where users post and interact with messages, known as “tweets. Twitter generates all kinds of data like structured (users, block notifications, phones, favorites, saved searches, re-tweets, authentications, SMS usage, user followings), un-structured and semi-structured data (Twitter Apache logs, Twitter search logs, Twitter MySQL query logs, application logs) . On daily basis Twitter generates around 10 TB of data.

As the size of data increases, Twitter started to use Bigdata Hadoop for data storage and data analysis using MapReduce in the beginning.

Example

Let’s take an example of analyzing number of tweets are stored per user in tweet table, we can use MapReduce to solve the problem. The diagram below explains different stages of MapReduce.

Working: In MapReduce program first we need to inputs the key as records and it will sends the tweet table information to mapper function, then Mapper function will select the user id and corresponding value to every user id. In next stage Shuffle function will sort similar user id’s together. Finally Reduce function adds all the number of tweets together belonging to same user. The final output will be user id, combination of user name and the number of tweets per user.

* While analyzing the data using MapReduce, we came across some of the limitations like,

  • Joining data sets
  • Grouping data sets
  • Sorting data sets

* The above limitations of MapReduce are solved in Apache Pig.

* In Apache pig joining, grouping and sorting data set is very simple. Below image clearly explain the analysis of twitter data sets using Apache pig.

Example

Problem statement

“Analyze the below tables like user table and tweet table and find out how many tweets are stored per person”

Solution:

  1. First we have to import user table and tweet table into HDFS using Flume (explained in Flume article).
  2. Then load these tables into Pig Storage using Load operator.
  3. Here joins and groups the two table using COGROUP command (explained in Grouping in Pig Latin article) .
  4. Count the number of tweets using COUNT command (explained in eval function article). It will give the count of total number of tweets per user.
  5. Here we are performing join operation with what ever result we get and user table to extract the user name with produced result.
  6. Finally what ever the results we got from the above steps are stored  back into the HDFS.

Below diagram summarizes the Twitter Case Study with Apache pig. By using above methods we can do sentimental analysis.