Setting up Apache Hadoop

2017-09-30 —Categories: Big Data

Setting up Apache Hadoop

As part of my Dynamic Big Data course, I have to set up a distributed file system to experiment with various mapreduce concepts. Let’s use Hadoop since it’s widely adopted. Thankfully, there are instructions on how to set up Apache Hadoop – we’re starting with a single cluster for now.

I guess we can include VM setup instructions for completeness (for those that don’t already have a Linux environment).

Download and install VirtualBox then grab a copy of the Ubuntu desktop ISO.
Create a new VM in VirtualBox. I elected to use the default 1024MB memory allocation and 10.00GB hard disk. While there may be various reasons for choosing various Hard disk file types, let’s proceed with the default VDI format and have the space be dynamically allocated. Let’s set the size of the virtual hard disk to 256GB as well (“the limit on the amount of file data that a VM will be able to store on the hard disk”).
Select the VM in the left pane, click on the “[Optical Drive] Empty” label in the Storage group on the right pane and select the “Choose Disk Image…” menu option. Select the downloaded Ubuntu ISO (ubuntu-16.04.3-desktop-amd64.iso in my case).
Start the virtual machine. Once Ubuntu boots up, you will be prompted to try it or install it. Let’s install it. Be sure to check the Download updates while installing Ubuntu option so that we don’t need to install updates later.
Since there are no OSs on the VM, proceed with the Erase disk and install Ubuntu option. You may need to pick a username and password before the installation will finally begin!

Once installation is complete, log onto the Ubuntu OS. Set up shared folders and enable the bidirectional clipboard as follows:

From the VirtualBox Devices menu, choose Insert Guest Additions CD image… A prompt will be displayed stating that “VBOXADDITIONS_5.1.26_117224” contains software intended to be automatically started. Just click on the Run button to continue and enter the root password. When the guest additions installer completes, press Return to close the window when prompted.
From the VirtualBox Devices menu, choose Shared Clipboard > Bidirectional. This enables two way clipboard functionality between the guest and host.
From the VirtualBox Devices menu, choose Shared Folders > Shared Folders Settings… Click on the add Shared Folder button and enter a path to a folder on the host that you would like to be shared. Optionally select Auto-mount and Make Permanent.
Open a terminal window. Enter these commands to mount the shared folder (assuming you named it vmshare in step 3 above):

mkdir ~/vmshare
sudo mount -t vboxsf -o uid=$UID,gid=$(id -g) vmshare ~/vmshare

To start installing the software we need, enter these commands:

sudo apt-get update
sudo apt install default-jdk

Next, get a copy of the Hadoop binaries from an Apache download mirror.

cd ~/Downloads
wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz
mkdir ~/hadoop-2.7.4
tar xvzf hadoop-2.7.4.tar.gz -C ~/
cd ~/hadoop-2.7.4

The Apache Single Node Cluster Tutorial says to

export JAVA_HOME=/usr/java/latest

in the etc/hadoop/hadoop-env.sh script. On this Ubuntu setup, we end up needing to

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

If you skip setting up this export, running bin/hadoop will give this error:

Error: JAVA_HOME is not set and could not be found.

Note: I found that setting JAVA_HOME=/usr caused subsequent processes (like generating Eclipse projects from the source using mvn) to fail even though the steps in the tutorial worked just fine.

To verify that Hadoop is now configured and ready to run (in a non-distributed mode as a single Java process), execute the commands listed in the tutorial.

$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar grep input output 'dfs[a-z.]+'
$ cat output/*

The bin/hadoop jar command runs the code in the .jar file, specifically the code in Grep.java, passing it the last 3 arguments. The output should resemble this summary:

If you’re interested in the details of this example (e.g. to inspect Grep.java), examine the src subfolder. If you don’t need the binaries and just want to look at the code, you can wget it from a download mirror, e.g.:

wget http://apache.cs.utah.edu/hadoop/common/stable/hadoop-2.7.4-src.tar.gz

Saint's Log

Setting up Apache Hadoop

Article info

Categories

Leave a Reply