Monday, December 16, 2013

Mapreduce

Due to the need of huge capture files from CAIDA (hundreds of Gigabytes) for my research, I decided to create a cluster and use a Map-Reduce programming model, namely I use Hadoop.

Installing hadoop
The first step is installing Hadoop (I found useful the following tutorials):
https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html
https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/


HDFS
Then we create a HDFS system with two folders, input and output, and we upload the files to process into the input folder.

someone@anynode:hadoop$ bin/hadoop dfs -ls
Found 2 items
drwxr-xr-x   - username supergroup          0 2013-12-22 12:51 /user/username/input
drwxr-xr-x   - username supergroup          0 2013-12-22 12:50 /user/username/output

More info: http://developer.yahoo.com/hadoop/tutorial/module2.html

The input to our program, after preprocessing, are a file:
132.156.180.68 22 146.25.6.199 33980 1070631061.367672000

Map Reduce (number of packets):
We start with an easy program to know the number of packets each server receives and sends.

We develop a custom key class to pass from the map to the reduce. This class has to implement WritableComparable. (cf. [1, Example 4.7] or http://developer.yahoo.com/hadoop/tutorial/module5.html)

Code coming pretty soon.



Reference guide for me:
[1] Tom White, "Hadoop: The Definitive Guide", O'Reilly

No comments :

Post a Comment