Saturday, October 18, 2008

setting up hadoop

Have updated the post with latest hadoop config changes compatible with hadoop 1.1.2

Hadoop is a distributes file system similar to google file system. It uses map-reduce to process large amounts of data on a large number of nodes. I will give a brief step by step process to set up hadoop on single and multiple nodes.

First lets go with a single node:


  • Download hadoop.tar.gz from hadoop.apache.org.

  • You can setup hadoop to work on any user, but it is preferred that you setup a separate user for running hadoop.
    sudo addgroup hadoop
    sudo adduser -g hadoop hadoop

  • untar hadoop.tar.gz file in the user "hadoop's" home directory
    [hadoop@linuxbox ~]$ tar -xvzf hadoop.tar.gz

  • check version of java - it should be atleast java 1.5 - preferred java 1.6
    $ java -version
    java version "1.6.0"
    Java(TM) SE Runtime Environment (build 1.6.0-b105)
    Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)

  • Hadoop requires to ssh to the local server. So you would need to creat keys on local machine so that the ssh does not require password.
    $ ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    fb:7a:cf:c5:c0:ec:30:a7:f9:eb:f0:a4:8b:da:6f:88 hadoop@linuxbox

    now copy the public key to the authorized_keys file, so that ssh should not require passwords
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    Now check
    $ ssh localhost
    Last login: Sat Oct 18 18:30:57 2008 from localhost
    $

  • Change environment parameters in hadoop-env.sh
    export JAVA_HOME=/path/to/jdk_home_dir

  • Change configuration parameters in hadoop-site.xml. 
In hadoop 1.1.1, hadoop-site.xml has been replaced by 3 files - core-site.xml, hdfs-site.xml and mapred-site.xml

<configuration>
in core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>


in hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

in mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

</configuration>


  • Format the name node
    $ cd /home/hadoop
    $ ./bin/hadoop namenode -format

    Check the output for errors.

  • Start single node cluster
    $ <HADOOP_INSTALL>/bin/start-all.sh
    This should start the namenode, datanode, jobtracker and tasktracker - all on one machine.

  • Check whether the nodes are up and running. The output should be approximately like
    $ jps
    28982 JobTracker
    28737 DataNode
    28615 NameNode
    30570 Jps
    29109 TaskTracker
    28870 SecondaryNameNode

  • In case of any error, Please check the log files in the <HADOOP_INSTALL_DIR>/logs directory.

  • To stop the node, run
    $ <HADOOP_INSTALL_DIR>/bin/stop-all.sh

We will skip running actual map-reduce tasks on a single node setup and go ahead with a multi-node setup. Once we have 2 machines up and running, we will run some example map-reduce tasks on those nodes. So, lets proceed with multi-node setup.

For multi node setup, you should have 2 machines up and running and both having hadoop - single node setup on them. We will refer the machines as master and slave, And assume that hadoop has been installed under /home/hadoop directory in both the nodes.


  • Firstly stop the single node hadoop running on them.
    $ <HADOOP_INSTALL_DIR>/bin/stop-all.sh

  • Edit /etc/hosts file on both the servers to setup master and slave names. Eg:
    aaa.bbb.ccc.ddd master
    www.xxx.yyy.zzz slave

  • Now, the master should be able to ssh to the slave server without any password, so copy the public key of master to that of slave.
    master]$ ssh-keygen -t rsa
    master ~/.ssh]$ scp id_rsa.pub hadoop@slave:.ssh/
    slave ~/.ssh]$ cat id_rsa.pub >> authorized_keys

    Test the ssh setup
    master]$ ssh master
    master]$ ssh slave
    slave ]$ ssh slave

  • Change the <HADOOP_INSTALL_DIR>/conf/masters & <HADOOP_INSTALL_DIR>/conf/slaves file to add the master & slave hosts there on the master server. The files should look like this.

    master ~/hadoop/conf]$ cat masters
    master


    The slaves file contains the hosts(one per line) where hadoop slave daemons (data nodes and task trackers) would run. In our case we are running the datanode & tasktracker on both machines. In addition the master server would also run the master related services (namenode). Both master and slave would store data.

    master ~/hadoop/conf]$ cat slaves
    master
    slave

  • Now change the configuration (<HADOOP_INSTALL_DIR>/conf/hadoop-site.xml) on all machines (master & slave). Set/change the following variables.

    Specify the host and port of the name node(master server).
    fs.default.name = hdfs://master:54310

    Specify the host and port of the job tracker (map reduce master).
    mapred.job.tracker = master:54311

    Specify the number of machines a single file should be replicated to before it becomes available. It should be equal to the number of slave nodes. In our case it is 2 (master & slave - both act as slaves as well).
    dfs.replication = 2

  • You need to format the namenode recreate the datanode. Do the following
    master ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred
    slave ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred

    Recreate/reformat the name node
    master ~/hadoop] $ ./bin/hadoop namenode -format

  • Start the cluster.
    [master ~/hadoop]$ ./bin/start-dfs.sh
    starting namenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-namenode-master.out
    master: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-master.out
    slave: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-slave.out
    master: starting secondarynamenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-master.out

    Check the processes running on the master node
    [master ~/hadoop]$ jps
    5249 SecondaryNameNode
    5319 Jps
    5117 DataNode
    4995 NameNode

    Check the processes running on slave node
    [slave ~/hadoop]$ jps
    22256 Jps
    22203 DataNode

    Check the logs on the slave for errors. <HADOOP_INSTALL_DIR>/logs/hadoop-hadoop-datanode-slave.log

  • Now start the mapreduce daemons:
    [master ~/hadoop]$ ./bin/start-mapred.sh
    starting jobtracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
    slave: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slave.out
    master: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-master.out

    Check the processes on master
    [master ~/hadoop]$ jps
    5249 SecondaryNameNode
    5117 DataNode
    5725 TaskTracker
    5598 JobTracker
    5853 Jps
    4995 NameNode

    And the processes on the slave
    [slave ~/hadoop]$ jps
    22735 TaskTracker
    22856 Jps
    22413 DataNode


To shut down the hadoop cluste run the following on master

[master ~/hadoop]$ ./bin/stop-mapred.sh # to stop mapreduce daemons
[master ~/hadoop]$ ./bin/stop-dfs.sh # to stop the hdfs daemons


Now, lets populate some files on the hdfs and see if we can run some programs

Get the following files on your local filesystem in some test directory on master

[master ~/test]$ wget http://www.gutenberg.org/files/20417/20417-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
[master ~/test]$ wget http://www.gutenberg.org/files/4300/4300-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt


Populate the files in the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -copyFromLocal ../test/ test

Check the files on the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test
[master ~/hadoop]$ ./bin/hadoop dfs -ls test
Found 4 items
-rw-r--r-- 2 hadoop supergroup 674425 2008-10-20 12:37 /user/hadoop/test/20417-8.txt
-rw-r--r-- 2 hadoop supergroup 1573048 2008-10-20 12:37 /user/hadoop/test/4300-8.txt
-rw-r--r-- 2 hadoop supergroup 1423808 2008-10-20 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r--r-- 2 hadoop supergroup 590093 2008-10-20 12:37 /user/hadoop/test/advsh12.txt


Now lets run some test programs. Lets run the wordcount example and collect the output in the test-op directory.

[master ~/hadoop]$ ./bin/hadoop jar hadoop-0.18.1-examples.jar wordcount test test-op
08/10/20 12:49:45 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.JobClient: Running job: job_200810201146_0003
08/10/20 12:49:47 INFO mapred.JobClient: map 0% reduce 0%
08/10/20 12:49:52 INFO mapred.JobClient: map 50% reduce 0%
08/10/20 12:49:56 INFO mapred.JobClient: map 100% reduce 0%
08/10/20 12:50:02 INFO mapred.JobClient: map 100% reduce 16%
08/10/20 12:50:05 INFO mapred.JobClient: Job complete: job_200810201146_0003
08/10/20 12:50:05 INFO mapred.JobClient: Counters: 16
08/10/20 12:50:05 INFO mapred.JobClient: File Systems
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes read=4261374
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes written=949192
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes read=2044286
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes written=3757882
08/10/20 12:50:05 INFO mapred.JobClient: Job Counters
08/10/20 12:50:05 INFO mapred.JobClient: Launched reduce tasks=1
08/10/20 12:50:05 INFO mapred.JobClient: Launched map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Data-local map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Map-Reduce Framework
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input groups=88307
08/10/20 12:50:05 INFO mapred.JobClient: Combine output records=205890
08/10/20 12:50:05 INFO mapred.JobClient: Map input records=90949
08/10/20 12:50:05 INFO mapred.JobClient: Reduce output records=88307
08/10/20 12:50:05 INFO mapred.JobClient: Map output bytes=7077676
08/10/20 12:50:05 INFO mapred.JobClient: Map input bytes=4261374
08/10/20 12:50:05 INFO mapred.JobClient: Combine input records=853602
08/10/20 12:50:05 INFO mapred.JobClient: Map output records=736019
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input records=88307

Now, lets check the output.

[master ~/hadoop]$ ./bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:45 /user/hadoop/test-op/_logs
-rw-r--r-- 2 hadoop supergroup 949192 2008-10-20 12:46 /user/hadoop/test-op/part-00000
[master ~/hadoop]$ ./bin/hadoop dfs -copyToLocal test-op/part-00000 test-op-part-00000
[master ~/hadoop]$ head test-op-part-00000
"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1


That's it... We have a live setup of hadoop running on two machines...

5 comments:

Mat said...

Good instructions. I just tried it on 2 different machines and it worked ! Thanks !

Anonymous said...

Jayant

Nice posting. Good Work. one question how to do i configure hadoop such so that i can see system.out.println statements in the log or std. Please help me out

thanks and regards
Joseph

Anonymous said...

nice..it helped me lot.thank u very munch

Anonymous said...

It really help alot..

Thanks yar..

himani said...

hey guys...can anybody tell me where hadoop-eng.sh file is located...could not find....please help