Flume Hands-on Streaming Twitter Data

In this article will study how to send streaming data to the hdfs using Apache Flume.

Introduction

Flume is designed to fetch the streaming data from various web servers and transport to the centralized stores like HDFS or hbase for analytical process.

In Flume Architecture article we have studied that, web server generates streaming data. Then these data collected by the Flume agent. The channel buffers this data to a sink, at last data pushes to the centralized stores like HDFS.

Example

In this Hands-on example will create twitter application to fetch twitter streaming data, and then configure the Flume to push data into HDFS then verify the data.

Create twitter application

In order to get streaming data we need to create twitter application.

Step 1: Firstly we need to visit URL: https://apps.twitter.com/ and sign in to create twitter account then click on create new app to create an application.

Step 2: Click on create New App button then it redirected to a window there will get an application form in which we have to fill in our details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.

Step 3: Fill in the details then accept the Developer Agreement. Click on the Create your Twitter application button. Then App will be created with the given details as shown below.

Step 4: Then click on keys and Access Tokens tab at the bottom of the page, it will navigate to another page you can observe a button named Create my access token. Click on it to generate the access token.

Step 5: After that it redirect to another page, there Copy the key and the access token. Then pass these tokens to Flume configuration file to connect to this application.

Configure the Flume

Now create a flume.conf file in the flume root directory as shown below. As we discussed in the Flume configuration article configure Source, Sink and Channel. Here Source is Twitter, from twitter we are streaming the data and Sink is HDFS, where we are writing the data.

$ cd $FLUME_HOME

$ sudo gedit flume.conf

Here we are Configuration source by passing twitter as a source type.

i.e  “org.apache.flume.source.twitter.TwitterSource”. Then pass all the four tokens received from the twitter. At last in source configuration we are passing the keywords on which we are fetching from the tweet application.

In Sink configuration, configure HDFS properties like HDFS path, write format, file type, batch size etc.

Execution

Below command is used to execute Flume agent.

$FLUME_HOME/bin/flume-ng agent –conf ./conf/ -f $FLUME_HOME/flume.conf

Verification

Below command used to verify the created file in hadoop.

$ hadoop fs -ls  /user/Hadoop/beyond-corner/twitterdata/

Conclusion

In this article we are creating twitter app to get streaming data, then configuring Flume with Source, Channel and sink to transfer data into HDFS. Below Diagram summarize the example of Streaming twitter data.