Categories: Big Data

Exploring Apache’s Hadoop MapReduce Tutorial

In the last post, I described the straightforward process of setting up and Ubuntu VM in which to run Hadoop. Once you can successfully run the Hadoop MapReduce example in the MapReduce Tutorial, you may be interested in examining the source code using an IDE like Eclipse. To do so, install eclipse:

sudo apt-get install eclipse-platform

Some common Eclipse settings to adjust:

  1. Show line numbers (Window > Preferences > General > Editors > Text Editors > Show Line Numbers
  2. To make Eclipse use spaces instead of tabs (or vice versa), see this StackOverflow question.
  3. To auto-remove trailing whitespace in Eclipse, see this StackOverflow question.

To generate an Eclipse project for the Hadoop source code, the src/BUILDING.txt file lists these steps (which we cannot yet run):

cd ~/hadoop-2.7.4/src/hadoop-maven-pluggins
mvn install
cd ..
mvn eclipse:eclipse -DskipTests

To be able to run these commands, we need to install the packages required for building Hadoop. They are also listed in the src/BUILDING.txt file. For the VM we set up, we do not need to install the packages listed under Oracle JDK 1.7. Instead, run these commands to install Maven, native libraries, and ProtocolBuffer:

sudo apt-get -y install maven
sudo apt-get -y install build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev
sudo apt-get -y install libprotobuf-dev protobuf-compiler

Now here’s where things get interesting. The last command installs version 2.6.1 of the ProtocolBuffer. The src/BUILDING.txt file states that version 2.5.0 is required. Turns out they aren’t kidding – if you try generating the Eclipse project using version 2.6.1 (or some non 2.5.0 version), you’ll get an error similar to this one:

As suggested here and here, you can check the version by typing:

protoc --version

How do we install 2.5.0? Turns out we have to build ProtocolBuffer 2.5.0 from the source code ourselves but we need to grab the sources from Github now (unlike those now outdated instructions): https://github.com/google/protobuf/releases/tag/v2.5.0

mkdir ~/protobuf
cd ~/protobuf
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xvzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0

Now follow the instructions in the README.txt file to build the source code.

./configure --prefix=/usr
make
make check
sudo make install
protoc --version

The output from the last command should now be “libprotoc 2.5.0“. Note: you most likely need to pass the –prefix option to ./configure to avoid errors like the one below.

Now we can finally generate the Eclipse project files for the Hadoop sources.

cd ~/hadoop-2.7.4/src/hadoop-maven-plugins
mvn install
cd ..
mvn eclipse:eclipse -DskipTests

Once project-file generation is complete:

  1. Type eclipse to launch the IDE.
  2. Go to the File > Import… menu option.
  3. Select the Existing Projects into Workspace option under General.
  4. Browse to the ~/hadoop-2.7.4/src folder in the Select root directory: input. A list of the projects in the src folder should be displayed.
  5. Click Finish to import the projects.

You should now be able to navigate to the WordCount.java file and inspect the various Hadoop classes.