Have updated the post with latest hadoop config changes compatible with hadoop 1.1.2
Hadoop is a distributes file system similar to google file system. It uses map-reduce to process large amounts of data on a large number of nodes. I will give a brief step by step process to set up hadoop on single and multiple nodes.
First lets go with a single node:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
in hdfs-site.xml
Hadoop is a distributes file system similar to google file system. It uses map-reduce to process large amounts of data on a large number of nodes. I will give a brief step by step process to set up hadoop on single and multiple nodes.
First lets go with a single node:
- Download hadoop.tar.gz from hadoop.apache.org.
- You can setup hadoop to work on any user, but it is preferred that you setup a separate user for running hadoop.
sudo addgroup hadoop
sudo adduser -g hadoop hadoop - untar hadoop.tar.gz file in the user "hadoop's" home directory
[hadoop@linuxbox ~]$ tar -xvzf hadoop.tar.gz - check version of java - it should be atleast java 1.5 - preferred java 1.6
$ java -version
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode) - Hadoop requires to ssh to the local server. So you would need to creat keys on local machine so that the ssh does not require password.
$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
fb:7a:cf:c5:c0:ec:30:a7:f9:eb:f0:a4:8b:da:6f:88 hadoop@linuxbox
now copy the public key to the authorized_keys file, so that ssh should not require passwords
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now check
$ ssh localhost
Last login: Sat Oct 18 18:30:57 2008 from localhost
$ - Change environment parameters in hadoop-env.sh
export JAVA_HOME=/path/to/jdk_home_dir - Change configuration parameters in hadoop-site.xml.
<configuration>
in core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
in mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
We will skip running actual map-reduce tasks on a single node setup and go ahead with a multi-node setup. Once we have 2 machines up and running, we will run some example map-reduce tasks on those nodes. So, lets proceed with multi-node setup.
For multi node setup, you should have 2 machines up and running and both having hadoop - single node setup on them. We will refer the machines as master and slave, And assume that hadoop has been installed under /home/hadoop directory in both the nodes.
To shut down the hadoop cluste run the following on master
[master ~/hadoop]$ ./bin/stop-mapred.sh # to stop mapreduce daemons
[master ~/hadoop]$ ./bin/stop-dfs.sh # to stop the hdfs daemons
Now, lets populate some files on the hdfs and see if we can run some programs
Get the following files on your local filesystem in some test directory on master
[master ~/test]$ wget http://www.gutenberg.org/files/20417/20417-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
[master ~/test]$ wget http://www.gutenberg.org/files/4300/4300-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt
Populate the files in the hdfs file system
[master ~/hadoop]$ ./bin/hadoop dfs -copyFromLocal ../test/ test
Check the files on the hdfs file system
[master ~/hadoop]$ ./bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test
[master ~/hadoop]$ ./bin/hadoop dfs -ls test
Found 4 items
-rw-r--r-- 2 hadoop supergroup 674425 2008-10-20 12:37 /user/hadoop/test/20417-8.txt
-rw-r--r-- 2 hadoop supergroup 1573048 2008-10-20 12:37 /user/hadoop/test/4300-8.txt
-rw-r--r-- 2 hadoop supergroup 1423808 2008-10-20 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r--r-- 2 hadoop supergroup 590093 2008-10-20 12:37 /user/hadoop/test/advsh12.txt
Now lets run some test programs. Lets run the wordcount example and collect the output in the test-op directory.
[master ~/hadoop]$ ./bin/hadoop jar hadoop-0.18.1-examples.jar wordcount test test-op
08/10/20 12:49:45 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.JobClient: Running job: job_200810201146_0003
08/10/20 12:49:47 INFO mapred.JobClient: map 0% reduce 0%
08/10/20 12:49:52 INFO mapred.JobClient: map 50% reduce 0%
08/10/20 12:49:56 INFO mapred.JobClient: map 100% reduce 0%
08/10/20 12:50:02 INFO mapred.JobClient: map 100% reduce 16%
08/10/20 12:50:05 INFO mapred.JobClient: Job complete: job_200810201146_0003
08/10/20 12:50:05 INFO mapred.JobClient: Counters: 16
08/10/20 12:50:05 INFO mapred.JobClient: File Systems
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes read=4261374
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes written=949192
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes read=2044286
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes written=3757882
08/10/20 12:50:05 INFO mapred.JobClient: Job Counters
08/10/20 12:50:05 INFO mapred.JobClient: Launched reduce tasks=1
08/10/20 12:50:05 INFO mapred.JobClient: Launched map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Data-local map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Map-Reduce Framework
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input groups=88307
08/10/20 12:50:05 INFO mapred.JobClient: Combine output records=205890
08/10/20 12:50:05 INFO mapred.JobClient: Map input records=90949
08/10/20 12:50:05 INFO mapred.JobClient: Reduce output records=88307
08/10/20 12:50:05 INFO mapred.JobClient: Map output bytes=7077676
08/10/20 12:50:05 INFO mapred.JobClient: Map input bytes=4261374
08/10/20 12:50:05 INFO mapred.JobClient: Combine input records=853602
08/10/20 12:50:05 INFO mapred.JobClient: Map output records=736019
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input records=88307
Now, lets check the output.
[master ~/hadoop]$ ./bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:45 /user/hadoop/test-op/_logs
-rw-r--r-- 2 hadoop supergroup 949192 2008-10-20 12:46 /user/hadoop/test-op/part-00000
[master ~/hadoop]$ ./bin/hadoop dfs -copyToLocal test-op/part-00000 test-op-part-00000
[master ~/hadoop]$ head test-op-part-00000
"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1
That's it... We have a live setup of hadoop running on two machines...
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
in mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
- Format the name node
$ cd /home/hadoop
$ ./bin/hadoop namenode -format
Check the output for errors. - Start single node cluster
$ <HADOOP_INSTALL>/bin/start-all.sh
This should start the namenode, datanode, jobtracker and tasktracker - all on one machine. - Check whether the nodes are up and running. The output should be approximately like
$ jps
28982 JobTracker
28737 DataNode
28615 NameNode
30570 Jps
29109 TaskTracker
28870 SecondaryNameNode - In case of any error, Please check the log files in the <HADOOP_INSTALL_DIR>/logs directory.
- To stop the node, run
$ <HADOOP_INSTALL_DIR>/bin/stop-all.sh
We will skip running actual map-reduce tasks on a single node setup and go ahead with a multi-node setup. Once we have 2 machines up and running, we will run some example map-reduce tasks on those nodes. So, lets proceed with multi-node setup.
For multi node setup, you should have 2 machines up and running and both having hadoop - single node setup on them. We will refer the machines as master and slave, And assume that hadoop has been installed under /home/hadoop directory in both the nodes.
- Firstly stop the single node hadoop running on them.
$ <HADOOP_INSTALL_DIR>/bin/stop-all.sh - Edit /etc/hosts file on both the servers to setup master and slave names. Eg:
aaa.bbb.ccc.ddd master
www.xxx.yyy.zzz slave - Now, the master should be able to ssh to the slave server without any password, so copy the public key of master to that of slave.
master]$ ssh-keygen -t rsa
master ~/.ssh]$ scp id_rsa.pub hadoop@slave:.ssh/
slave ~/.ssh]$ cat id_rsa.pub >> authorized_keys
Test the ssh setup
master]$ ssh master
master]$ ssh slave
slave ]$ ssh slave - Change the <HADOOP_INSTALL_DIR>/conf/masters & <HADOOP_INSTALL_DIR>/conf/slaves file to add the master & slave hosts there on the master server. The files should look like this.
master ~/hadoop/conf]$ cat masters
master
The slaves file contains the hosts(one per line) where hadoop slave daemons (data nodes and task trackers) would run. In our case we are running the datanode & tasktracker on both machines. In addition the master server would also run the master related services (namenode). Both master and slave would store data.
master ~/hadoop/conf]$ cat slaves
master
slave - Now change the configuration (<HADOOP_INSTALL_DIR>/conf/hadoop-site.xml) on all machines (master & slave). Set/change the following variables.
Specify the host and port of the name node(master server).
fs.default.name = hdfs://master:54310
Specify the host and port of the job tracker (map reduce master).
mapred.job.tracker = master:54311
Specify the number of machines a single file should be replicated to before it becomes available. It should be equal to the number of slave nodes. In our case it is 2 (master & slave - both act as slaves as well).
dfs.replication = 2 - You need to format the namenode recreate the datanode. Do the following
master ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred
slave ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred
Recreate/reformat the name node
master ~/hadoop] $ ./bin/hadoop namenode -format - Start the cluster.
[master ~/hadoop]$ ./bin/start-dfs.sh
starting namenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-namenode-master.out
master: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-master.out
slave: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-slave.out
master: starting secondarynamenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-master.out
Check the processes running on the master node
[master ~/hadoop]$ jps
5249 SecondaryNameNode
5319 Jps
5117 DataNode
4995 NameNode
Check the processes running on slave node
[slave ~/hadoop]$ jps
22256 Jps
22203 DataNode
Check the logs on the slave for errors. <HADOOP_INSTALL_DIR>/logs/hadoop-hadoop-datanode-slave.log - Now start the mapreduce daemons:
[master ~/hadoop]$ ./bin/start-mapred.sh
starting jobtracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
slave: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slave.out
master: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-master.out
Check the processes on master
[master ~/hadoop]$ jps
5249 SecondaryNameNode
5117 DataNode
5725 TaskTracker
5598 JobTracker
5853 Jps
4995 NameNode
And the processes on the slave
[slave ~/hadoop]$ jps
22735 TaskTracker
22856 Jps
22413 DataNode
To shut down the hadoop cluste run the following on master
[master ~/hadoop]$ ./bin/stop-mapred.sh # to stop mapreduce daemons
[master ~/hadoop]$ ./bin/stop-dfs.sh # to stop the hdfs daemons
Now, lets populate some files on the hdfs and see if we can run some programs
Get the following files on your local filesystem in some test directory on master
[master ~/test]$ wget http://www.gutenberg.org/files/20417/20417-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
[master ~/test]$ wget http://www.gutenberg.org/files/4300/4300-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt
Populate the files in the hdfs file system
[master ~/hadoop]$ ./bin/hadoop dfs -copyFromLocal ../test/ test
Check the files on the hdfs file system
[master ~/hadoop]$ ./bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test
[master ~/hadoop]$ ./bin/hadoop dfs -ls test
Found 4 items
-rw-r--r-- 2 hadoop supergroup 674425 2008-10-20 12:37 /user/hadoop/test/20417-8.txt
-rw-r--r-- 2 hadoop supergroup 1573048 2008-10-20 12:37 /user/hadoop/test/4300-8.txt
-rw-r--r-- 2 hadoop supergroup 1423808 2008-10-20 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r--r-- 2 hadoop supergroup 590093 2008-10-20 12:37 /user/hadoop/test/advsh12.txt
Now lets run some test programs. Lets run the wordcount example and collect the output in the test-op directory.
[master ~/hadoop]$ ./bin/hadoop jar hadoop-0.18.1-examples.jar wordcount test test-op
08/10/20 12:49:45 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.JobClient: Running job: job_200810201146_0003
08/10/20 12:49:47 INFO mapred.JobClient: map 0% reduce 0%
08/10/20 12:49:52 INFO mapred.JobClient: map 50% reduce 0%
08/10/20 12:49:56 INFO mapred.JobClient: map 100% reduce 0%
08/10/20 12:50:02 INFO mapred.JobClient: map 100% reduce 16%
08/10/20 12:50:05 INFO mapred.JobClient: Job complete: job_200810201146_0003
08/10/20 12:50:05 INFO mapred.JobClient: Counters: 16
08/10/20 12:50:05 INFO mapred.JobClient: File Systems
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes read=4261374
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes written=949192
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes read=2044286
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes written=3757882
08/10/20 12:50:05 INFO mapred.JobClient: Job Counters
08/10/20 12:50:05 INFO mapred.JobClient: Launched reduce tasks=1
08/10/20 12:50:05 INFO mapred.JobClient: Launched map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Data-local map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Map-Reduce Framework
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input groups=88307
08/10/20 12:50:05 INFO mapred.JobClient: Combine output records=205890
08/10/20 12:50:05 INFO mapred.JobClient: Map input records=90949
08/10/20 12:50:05 INFO mapred.JobClient: Reduce output records=88307
08/10/20 12:50:05 INFO mapred.JobClient: Map output bytes=7077676
08/10/20 12:50:05 INFO mapred.JobClient: Map input bytes=4261374
08/10/20 12:50:05 INFO mapred.JobClient: Combine input records=853602
08/10/20 12:50:05 INFO mapred.JobClient: Map output records=736019
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input records=88307
Now, lets check the output.
[master ~/hadoop]$ ./bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:45 /user/hadoop/test-op/_logs
-rw-r--r-- 2 hadoop supergroup 949192 2008-10-20 12:46 /user/hadoop/test-op/part-00000
[master ~/hadoop]$ ./bin/hadoop dfs -copyToLocal test-op/part-00000 test-op-part-00000
[master ~/hadoop]$ head test-op-part-00000
"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1
That's it... We have a live setup of hadoop running on two machines...
5 comments:
Good instructions. I just tried it on 2 different machines and it worked ! Thanks !
Jayant
Nice posting. Good Work. one question how to do i configure hadoop such so that i can see system.out.println statements in the log or std. Please help me out
thanks and regards
Joseph
nice..it helped me lot.thank u very munch
It really help alot..
Thanks yar..
hey guys...can anybody tell me where hadoop-eng.sh file is located...could not find....please help
Post a Comment