Tool Write Up – Hadoop

One exercise I do every week as part of Getting Dramatically Better as a Programmer is learn a new tool. This week, I took a look at Hadoop. I’m going to walk through what I learned and mention a few interesting items I got out of it. This is both for forcing me to explain what I learned and for others to see what’s so cool about Hadoop. Most of what I learned was from the Hadoop docs.

Overview

Hadoop is a framework designed for dealing with large amounts of data. There’s two main pieces to Hadoop. There’s HDFS, the Hadoop filesystem, and there is Hadoop MapReduce.

HDFS is a distributed file system based on the Google File System. It looks like a regular old file system, but stores the data across many different servers. HDFS handles problems like coordinating what data is stored where and replicating the files. HDFS is also able to handle files much larger than an ordinary filesystem can deal with. HDFS is the part of Hadoop that handles the raw data storage.

The other part of Hadoop is Hadoop MapReduce. Like HDFS, Hadoop MapReduce is based on a tool created by Google. As the name suggests, it is based on the MapReduce framework. Hadoop MapReduce is the part of Hadoop that handles processing data. It takes care of moving data between many different servers and performing aggregations over the data.

Together, the two parts of Hadoop solve one of the hardest problems that come up when processing large amounts of data: running code across many different servers. Hadoop takes care of this by:

  • Breaking down the input data into many different pieces spread across many different servers.
  • Handling the scheduling of each part of the data processing job across many different servers.
  • Rescheduling the necessary parts of the job if a machine fails.
  • Aggregating the data from each part of the job together.

The rest of this post will mainly discuss the data processing portion of Hadoop, Hadoop MapReduce.

Outline of Hadoop MapReduce

As intended, writing a job to process large amounts of data using Hadoop MapReduce is straightforward. The framework takes care of all the hard parts. As the programmer, you only have to do two things. You only need to specify a “map” function and a “reduce” function. That’s all there is to it!

The core idea of MapReduce is that the map function takes a bit of raw input and processes it. The reduce function then takes the output from multiple map calls and combines them together. Any calculation that can be specified in terms of these two functions is trivial to turn into a distributed and fault tolerant job with Hadoop.

To be more specific, a Hadoop job takes an HDFS file as input. Hadoop splits up the HDFS into many different chunks (this is often already done by HDFS). Hadoop then calls the map function on each chunk. For each chunk, the map function processes it and returns an intermediate result. The result is in the form of a list of key-value pairs.

For each unique key, Hadoop gathers the list of all pairs with that key together. Hadoop then passes the list of values associated with that key to the reduce function. The reduce function combines the list of values into a single value. Hadoop will then write the result of the reduce function for each key to an output file.

Many different types of calculations can be expressed in terms of map and reduce and that’s exactly what makes Hadoop MapReduce so powerful.

Before we dig into what’s going on under the hood, let’s take a look at an example MapReduce application.

Example: Word Count

The code for this section is taken from this tutorial from the Hadoop docs.

The classic example of a MapReduce job is the word count problem. We want to count the number of times each word occurs in a file. Doing this when the file is on the order of gigabytes is no problem. When the file gets on the order of terabytes or even petabytes, you are going to have to bring out the more powerful tools like Hadoop.

To write the Hadoop job, we need to write a map function and a reduce function. As mentioned above, the map function takes a chunk of raw data and produces an intermediate result for that chunk. For the word count problem, we can write a map function that takes the raw text and produces a list of key-value pairs of the form (<word>, 1) for each word in the text.

To write the function, we need to write a class that extends the Mapper class. Then we override the map function with the definition we want. This looks like the following:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}

The map function takes the chunk of input, splits up the text into individual words, and outputs the key-value pair (<word>, 1) for each word. There is some MapReduce specific code, but the example should still be easy to understand.

Writing the reduce function is just as easy. We want to get the number of times each word appears in the file. Since the key of each tuple produced by the map function is the word seen, the reduce function is passed list of ones for each unique word. There is one entry in the list for each time the word appears.

The reduce function simply needs to take the list of values and count how many values there are. In order to allow for some optimizations by Hadoop, we will instead sum the list of values. Since each value in the list is a one, this winds up giving the same result as counting the number of values. We’ll get into what these optimizations are when we look at what’s going on under the hood. Here’s the code:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }
}

It simply iterates through the list of values and sums them.

Now that we have the map and reduce functions, we only need to write a small bit of code that initializes a job that uses them:

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "word count");
  job.setJarByClass(WordCount.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  System.exit(job.waitForCompletion(true) ? 0 : 1);
}

I bolded the important bits. The first two bolded lines specify the map function to use and the reduce function to use. The last bolded line kicks of the job and waits for it to complete. Most of the other lines are just boilerplate for initializing the job.

The above job can be ran across as many machines as you want. That makes it easy to scale the job up to much larger amounts of data than you would ordinarily be able to process without Hadoop.

Hadoop Under the Hood

In order to run a Hadoop job, Hadoop uses two kinds of processes. It has mappers, which are responsible for running the map function, and it has reducers, which are responsible for running the reduce function. Hadoop runs these processes across the different servers in the Hadoop cluster.

When running a Hadoop job, Hadoop first starts many different mapper processes. Hadoop sends the raw chunks of the file being processed to the mapper processes. The mapper processes then run the map function you specified. In order to minimize the amount of data sent over the network to the mappers, Hadoop will try to run the mapper functions on the same machines as where HDFS stores the file.

Once a mapper process finishes generating its output, it sends the output to the reducers. The mapper spreads its output across all the different reducers. To make sure all values with the same key are sent to the same reducer, Hadoop decides which reducer to send a particular key-value pair to based on the key. The process of mappers sending different chunks of their data to the reducers is known as the shuffle step.

As an optimization, the mapper will use the “combiner” function to combine values from the same mapper before they are sent to the reducer. The combiner function is an optional function you can specify specifically for this purpose. Before sending its output to the reducer, the map function will pass all values with the same key to the combiner function and then send the output of the combiner function to the reducer. This reduces the total amount of data sent over the network. Often the combiner function can be the same as the reduce function.

This is why the reducer we defined in the word count problem takes the sum of the values instead of the count of them. The reducer is really taking the sum of the values from each mapper. Each mapper had already passed the key-value pairs to the local combiner which counted how many times each word occured and handed that information off to the reducer.

As each reducer completes, they write their data to the output HDFS file. And that’s how a Hadoop MapReduce job works.

Takeaways

  • HDFS is a distributed file system that makes it easy to store lots of data.
  • Hadoop MapReduce is a framework that makes it easy to write distributed and fault tolerant jobs.

I find MapReduce pretty interesting. I do find the boilerplate a bit off putting, but that would be missing the forest for the trees. MapReduce saves way more code than the extra code needed for the boilerplate. I find it impressive how great of an abstraction MapReduce is. By dramatically limiting the programming model, it makes it trivial to write scalable software.