Whatever....: setting up hadoop

Saturday, October 18, 2008

setting up hadoop

Have updated the post with latest hadoop config changes compatible with hadoop 1.1.2

Hadoop is a distributes file system similar to google file system. It uses map-reduce to process large amounts of data on a large number of nodes. I will give a brief step by step process to set up hadoop on single and multiple nodes.

First lets go with a single node:

Download hadoop.tar.gz from hadoop.apache.org.

You can setup hadoop to work on any user, but it is preferred that you setup a separate user for running hadoop.
sudo addgroup hadoop
sudo adduser -g hadoop hadoop

untar hadoop.tar.gz file in the user "hadoop's" home directory
[hadoop@linuxbox ~]$ tar -xvzf hadoop.tar.gz

check version of java - it should be atleast java 1.5 - preferred java 1.6
$ java -version
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)

Hadoop requires to ssh to the local server. So you would need to creat keys on local machine so that the ssh does not require password.
$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
fb:7a:cf:c5:c0:ec:30:a7:f9:eb:f0:a4:8b:da:6f:88 hadoop@linuxbox
now copy the public key to the authorized_keys file, so that ssh should not require passwords
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now check
$ ssh localhost
Last login: Sat Oct 18 18:30:57 2008 from localhost
$

Change environment parameters in hadoop-env.sh
export JAVA_HOME=/path/to/jdk_home_dir

Change configuration parameters in hadoop-site.xml.

In hadoop 1.1.1, hadoop-site.xml has been replaced by 3 files - core-site.xml, hdfs-site.xml and mapred-site.xml

in core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

in hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

in mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

</configuration>

Format the name node
$ cd /home/hadoop
$ ./bin/hadoop namenode -format
Check the output for errors.

Start single node cluster
$ <HADOOP_INSTALL>/bin/start-all.sh
This should start the namenode, datanode, jobtracker and tasktracker - all on one machine.

Check whether the nodes are up and running. The output should be approximately like
$ jps
28982 JobTracker
28737 DataNode
28615 NameNode
30570 Jps
29109 TaskTracker
28870 SecondaryNameNode

In case of any error, Please check the log files in the <HADOOP_INSTALL_DIR>/logs directory.

To stop the node, run
$ <HADOOP_INSTALL_DIR>/bin/stop-all.sh

We will skip running actual map-reduce tasks on a single node setup and go ahead with a multi-node setup. Once we have 2 machines up and running, we will run some example map-reduce tasks on those nodes. So, lets proceed with multi-node setup.

For multi node setup, you should have 2 machines up and running and both having hadoop - single node setup on them. We will refer the machines as master and slave, And assume that hadoop has been installed under /home/hadoop directory in both the nodes.

Firstly stop the single node hadoop running on them.
$ <HADOOP_INSTALL_DIR>/bin/stop-all.sh

Edit /etc/hosts file on both the servers to setup master and slave names. Eg:
aaa.bbb.ccc.ddd master
www.xxx.yyy.zzz slave

Now, the master should be able to ssh to the slave server without any password, so copy the public key of master to that of slave.
master]$ ssh-keygen -t rsa
master ~/.ssh]$ scp id_rsa.pub hadoop@slave:.ssh/
slave ~/.ssh]$ cat id_rsa.pub >> authorized_keys
Test the ssh setup
master]$ ssh master
master]$ ssh slave
slave ]$ ssh slave

Change the <HADOOP_INSTALL_DIR>/conf/masters & <HADOOP_INSTALL_DIR>/conf/slaves file to add the master & slave hosts there on the master server. The files should look like this.

master ~/hadoop/conf]$ cat masters
master

The slaves file contains the hosts(one per line) where hadoop slave daemons (data nodes and task trackers) would run. In our case we are running the datanode & tasktracker on both machines. In addition the master server would also run the master related services (namenode). Both master and slave would store data.

master ~/hadoop/conf]$ cat slaves
master
slave

Now change the configuration (<HADOOP_INSTALL_DIR>/conf/hadoop-site.xml) on all machines (master & slave). Set/change the following variables.

Specify the host and port of the name node(master server).
fs.default.name = hdfs://master:54310

Specify the host and port of the job tracker (map reduce master).
mapred.job.tracker = master:54311

Specify the number of machines a single file should be replicated to before it becomes available. It should be equal to the number of slave nodes. In our case it is 2 (master & slave - both act as slaves as well).
dfs.replication = 2

You need to format the namenode recreate the datanode. Do the following
master ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred
slave ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred

Recreate/reformat the name node
master ~/hadoop] $ ./bin/hadoop namenode -format

Start the cluster.
[master ~/hadoop]$ ./bin/start-dfs.sh
starting namenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-namenode-master.out
master: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-master.out
slave: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-slave.out
master: starting secondarynamenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-master.out
Check the processes running on the master node
[master ~/hadoop]$ jps
5249 SecondaryNameNode
5319 Jps
5117 DataNode
4995 NameNode
Check the processes running on slave node
[slave ~/hadoop]$ jps
22256 Jps
22203 DataNode
Check the logs on the slave for errors. <HADOOP_INSTALL_DIR>/logs/hadoop-hadoop-datanode-slave.log

Now start the mapreduce daemons:
[master ~/hadoop]$ ./bin/start-mapred.sh
starting jobtracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
slave: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slave.out
master: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-master.out
Check the processes on master
[master ~/hadoop]$ jps
5249 SecondaryNameNode
5117 DataNode
5725 TaskTracker
5598 JobTracker
5853 Jps
4995 NameNode
And the processes on the slave
[slave ~/hadoop]$ jps
22735 TaskTracker
22856 Jps
22413 DataNode

To shut down the hadoop cluste run the following on master

[master ~/hadoop]$ ./bin/stop-mapred.sh # to stop mapreduce daemons
[master ~/hadoop]$ ./bin/stop-dfs.sh # to stop the hdfs daemons

Now, lets populate some files on the hdfs and see if we can run some programs

Get the following files on your local filesystem in some test directory on master

[master ~/test]$ wget http://www.gutenberg.org/files/20417/20417-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
[master ~/test]$ wget http://www.gutenberg.org/files/4300/4300-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt

Populate the files in the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -copyFromLocal ../test/ test

Check the files on the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test
[master ~/hadoop]$ ./bin/hadoop dfs -ls test
Found 4 items
-rw-r--r-- 2 hadoop supergroup 674425 2008-10-20 12:37 /user/hadoop/test/20417-8.txt
-rw-r--r-- 2 hadoop supergroup 1573048 2008-10-20 12:37 /user/hadoop/test/4300-8.txt
-rw-r--r-- 2 hadoop supergroup 1423808 2008-10-20 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r--r-- 2 hadoop supergroup 590093 2008-10-20 12:37 /user/hadoop/test/advsh12.txt

Now lets run some test programs. Lets run the wordcount example and collect the output in the test-op directory.

[master ~/hadoop]$ ./bin/hadoop jar hadoop-0.18.1-examples.jar wordcount test test-op
08/10/20 12:49:45 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.JobClient: Running job: job_200810201146_0003
08/10/20 12:49:47 INFO mapred.JobClient: map 0% reduce 0%
08/10/20 12:49:52 INFO mapred.JobClient: map 50% reduce 0%
08/10/20 12:49:56 INFO mapred.JobClient: map 100% reduce 0%
08/10/20 12:50:02 INFO mapred.JobClient: map 100% reduce 16%
08/10/20 12:50:05 INFO mapred.JobClient: Job complete: job_200810201146_0003
08/10/20 12:50:05 INFO mapred.JobClient: Counters: 16
08/10/20 12:50:05 INFO mapred.JobClient: File Systems
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes read=4261374
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes written=949192
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes read=2044286
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes written=3757882
08/10/20 12:50:05 INFO mapred.JobClient: Job Counters
08/10/20 12:50:05 INFO mapred.JobClient: Launched reduce tasks=1
08/10/20 12:50:05 INFO mapred.JobClient: Launched map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Data-local map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Map-Reduce Framework
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input groups=88307
08/10/20 12:50:05 INFO mapred.JobClient: Combine output records=205890
08/10/20 12:50:05 INFO mapred.JobClient: Map input records=90949
08/10/20 12:50:05 INFO mapred.JobClient: Reduce output records=88307
08/10/20 12:50:05 INFO mapred.JobClient: Map output bytes=7077676
08/10/20 12:50:05 INFO mapred.JobClient: Map input bytes=4261374
08/10/20 12:50:05 INFO mapred.JobClient: Combine input records=853602
08/10/20 12:50:05 INFO mapred.JobClient: Map output records=736019
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input records=88307
Now, lets check the output.

[master ~/hadoop]$ ./bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:45 /user/hadoop/test-op/_logs
-rw-r--r-- 2 hadoop supergroup 949192 2008-10-20 12:46 /user/hadoop/test-op/part-00000
[master ~/hadoop]$ ./bin/hadoop dfs -copyToLocal test-op/part-00000 test-op-part-00000
[master ~/hadoop]$ head test-op-part-00000
"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1

That's it... We have a live setup of hadoop running on two machines...

5 comments:

Anonymous said...: Good instructions. I just tried it on 2 different machines and it worked ! Thanks !; 3/12/2009 4:34 AM
Anonymous said...: Jayant

Nice posting. Good Work. one question how to do i configure hadoop such so that i can see system.out.println statements in the log or std. Please help me out

thanks and regards
Joseph; 2/26/2011 4:14 AM
Anonymous said...: nice..it helped me lot.thank u very munch; 4/18/2011 9:45 AM
Anonymous said...: It really help alot..

Thanks yar..; 2/15/2012 3:24 PM
himani said...: hey guys...can anybody tell me where hadoop-eng.sh file is located...could not find....please help; 7/18/2012 12:20 PM