Introduction to MapReduce with Hadoop on Linux

MapReduce with Hadoop

Hadoop is mostly a Java framework, but the magically awesome Streaming utility allows us to use programs written in other languages. The program must only obey certain conventions for standard input and output (which we've already done).

You'll need Java 1.6.x or later (I used OpenJDK 7). The rest can and should be performed as a nonroot user.

Download the latest stable Hadoop tarball (see Resources). Don't use a distro-specific (.rpm or .deb) package. I'm assuming you downloaded hadoop-1.0.4.tar.gz. Unpack this and change into the hadoop-1.0.4 directory. The directory hadoop-1.0.4 and the files map.py, reduce.py and log.txt should be in /tmp. If not, adjust the paths in the examples below as necessary.

Run the job on Hadoop like so:


cd /tmp/hadoop-1.0.4
bin/hadoop jar \
  contrib/streaming/hadoop-streaming-1.0.4.jar \
  -mapper /tmp/map.py -reducer /tmp/reduce.py \
  -input /tmp/log.txt -output /tmp/output

Hadoop will log some stuff to the console. Look for the following:


...
INFO streaming.StreamJob:  map 0%  reduce 0%
INFO streaming.StreamJob:  map 100%  reduce 0%
INFO streaming.StreamJob:  map 100%  reduce 100%
INFO streaming.StreamJob: Output: /tmp/output

This means our job completed successfully. I see a file called /tmp/output/part-00000, which contains just what we expect:


dsl5.example.com        3

Now is a good time to pause, smile and reward yourself with a quad-shot grande iced caramel macchiato. You're a rockstar.

Figure 1. Here's what we did during the map and reduce steps. The transformations we performed allow us to run many mappers and reducers on as many machines as we want. Hadoop takes care of the gory details. It starts mappers and reducers, passes data between them and spits out the answer.

Clustered MapReduce

If you've got everything working so far, try starting your own cluster too! Running Hadoop on a single physical machine with multiple Java virtual machines is called pseudo-distributed operation.

Pseudo-distributed operation requires some configuration. The user you're running Hadoop as must also be able to make SSH passwordless connections to localhost. Installing and configuring this is beyond the scope of this article, but you'll find more information in the "Single Node Setup" tutorial mentioned in Resources. If you started with the 1.0.4 tarball release recommended above, the tutorial should work verbatim on any standard GNU/Linux distribution.

If you set up pseudo-distributed (or distributed) Hadoop, you'll gain the benefit of two spartan-but-useful Web interfaces. The NameNode Web interface allows you to browse logs and browse the Hadoop distributed filesystem. The JobTracker Web interface allows you to monitor MapReduce jobs and debug problems.

Figure 2. NameNode Web Interface

Figure 3. JobTracker Web Interface

Beautifully Simple Python MapReduce

You may wonder why reduce.py (above) is a convoluted mini-state machine. This is because hostnames change in the input lines provided by Hadoop. The Dumbo Python library (see Resources) hides this detail of Hadoop. Dumbo lets us focus even more tightly on our mapping and reducing.

In Dumbo, our MapReduce implementation becomes:


def mapper(key, value):
  if 'invalid user' in value:
    yield value.split(':')[0], 1

def reducer(key, values):
  yield key, sum(values)

if __name__ == '__main__':
  import dumbo
  dumbo.run(mapper, reducer)

The state machine is gone. Dumbo takes care of grouping by key (hostname).

Save the above code in a file called /tmp/smart.py. Install Dumbo. See Resources for a link, and don't worry, it's easy. Once Dumbo is installed, run the code:


cd /tmp
dumbo start smart.py -hadoop hadoop-1.0.4 \
  -input log.txt -output totals \
  -outputformat text

Finally, examine the output:


cat totals/part-00000

The content should match our earlier result from Hadoop Streaming.

Non-Use Cases

Hadoop is great for one-time jobs and off-line batch processing, especially where the data is already in the Hadoop filesystem and will be read many times. My first example makes more sense if you assume this. Perhaps the job must be run daily and must finish within a few minutes.

Consider some cases when Hadoop is the wrong tool. Small dataset? Don't bother. In a one-meter race between a rocket and a scooter, the scooter is gone before the rocket's engines are started. Transactional data storage for a Web site? Try MySQL or MongoDB instead.

Hadoop also won't help you process data as it arrives. This is often referred to as "real time" or "streaming". For that, consider Storm (see Resources for more information).

With practice, you'll quickly be able to discern when Hadoop is the right tool for the job.

Resources

You can download the latest stable Hadoop tarball from .

See http://hadoop.apache.org/docs/current/single_node_setup.html for information on how to run a pseudo-distributed Hadoop cluster.

Check out Dumbo at http://projects.dumbotics.com/dumbo if you want to do more with MapReduce in Python. See https://github.com/klbostee/dumbo/wiki/Building-and-installing for install instructions and https://github.com/klbostee/dumbo/wiki/Short-tutorial for an excellent tutorial.

See https://github.com/nathanmarz/storm for information on Storm, a real-time distributed computing system.

To run Storm and Hadoop and manage both centrally, check out the Mesos project at http://www.mesosproject.org.

______________________

Adam Monsen is a seasoned software engineer and FLOSS zealot. He lives with his wonderful wife and children in Seattle, Washington. He blogs semi-regularly at adammonsen.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

correction: terabyte, not petabyte

meonkeys's picture

Say you have 100 10GB log files from some custom application—roughly a petabyte of data.

That should read "roughly a terabyte", since, as one reader pointed out, 100 x 10GB is 1 TB.

My comment on that

Dan Smilry's picture

"Say you have 100 10GB log files from some custom application—roughly a petabyte of data. You do a quick test and estimate it will take your desktop days do grep every line (assuming you even could fit the data on your desktop)." That's a huge amount of time, and not so accurate testing.

Good point. That's why I said

meonkeys's picture

Good point. That's why I said "estimate"!

You can strive them on your

Marjorie Lanphear's picture

You can strive them on your laptop computer and so transition to a bigger cluster—like one you've got engineered with goods UNIX system machines, your company or university's Hadoop cluster or Amazon Elastic MapReduce.

  • If you would like wager on-line then this pokies have all the top internet gambling games.
White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState