由于没有那么多机器,就在自己的虚拟机上部署一套Hadoop集群,这被称作伪分布式集群,但是不管怎么样,这里主要记录部署hadoop的过程以及遇到的问题,然后再使用一个简单的程序测试环境。
1、安装JAVA、下载hadoop程序包,配置hadoop的环境变量。
这里要设置JAVA_HOME等于java的安装目录,将hadoop程序所在的目录添加到系统的PATH环境变量下,这样可以直接在shell中启动hadoop命令。这里使用的hadoop的2.6.0版本。
2、设置SSH
之所以需要安装ssh是因为Hadoop需要通过ssh方式启动slave列表中的各台机器上的守护进程,虽然我们这里称作伪分布式的方式安装,但是hadoop还是按照集群的方式启动的,只不过集群中的所有机器都是在同一台机器上罢了。ssh默认端口为22,可以查看端口22判断是否已经安装启动了。然后为了能够让hadoop通过ssh启动程序,需要免密码使用ssh,如果不进行设置,直接使用
ssh user@127.0.0.1(要确保本机已经安装了ssh服务器和客户端)会出现这样的情况: linuxidc@linuxidc:~/workplace$ ssh linuxidc@127.0.0.1 linuxidc@127.0.0.1's password: Welcome to Ubuntu 13.10 (GNU/Linux 3.11.0-12-generic i686)
* Documentation: https://help.ubuntu.com/
Last login: Mon Jan 19 15:03:01 2015 from localhost
也就是每次都需要你输入该用户的密码,为了配置免密码登录,需要执行如下的命令:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
第一个命令是生成一个密钥,-t表示迷药类型,这里使用dsa认证的方式,-P表示密码,这里使用的是空,-f表示生成密钥文件的地址;第二个目录是将生成的密钥的公钥拷贝成当前主机已授权的key文件,这样通过ssh命令连接主机就不需要密码了。可以再次通过上面的ssh命令检验。 3、配置hadoop的环境配置文件etc/hadoop/hadoop-env.sh 这个是hadoop的环境配置文件,需要配置JAVA_HOME的目录,确保该目录是java的安装目录。 4、配置etc/hadoop/core-site.xml配置文件 <configuration>
<property> <name>hadoop.tmp.dir</name> <value>/home/linuxidc/workplace/hadoop/data</value> </property>
<property> <name>fs.default.name</name> <value>hdfs://主机地址:9000</value> </property>
</configuration>
5、配置MapReduce配置文件etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>主机地址:9001</value> </property> </configuration>
6、配置HDFS配置文件etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property>
<property> <name>dfs.namenode.name.dir</name> <value>/home/linuxidc/workplace/hadoop/hdfs/name</value> </property>
<property> <name>dfs.datannode.data.dir</name> <value>/home/linuxidc/workplace/hadoop/hdfs/data</value> </property> </configuration>
7、格式化hdfs文件系统然后启动所有的模块hadoop namenode -format 该命令格式化HDFS文件系统。然后执行./sbin/start-all.sh,这时候会出现问题,如下: Starting namenodes on [Java HotSpot(TM) Client VM warning: You have loaded library /home/linuxidc/workplace/hadoop/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 查看发现这是由于平台的不兼容导致的,我下载的hadoop是64位的版本,而自己的机器却是32位的,所以这时候需要手动编译hadoop。 linuxidc@linuxidc-VirtualBox:~/workplace/hadoop/hadoop-2.6.0
Apache Hadoop 2.6.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.4.1.
Here is a short overview of the major features and improvements.
Common Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server. A new Hadoop metrics sink that allows writing directly to Graphite. Specification work related to the Hadoop Compatible Filesystem (HCFS) effort. HDFS Support for POSIX-style filesystem extended attributes. See the user documentation for more details. Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API. The NFS gateway received a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports. The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript. YARN YARN's REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs. The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos. The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue.
首先,创建一个新的文件,将这段英文的内容复制到该文件中: cat > test 然后将新创建的test文件放到HDFS文件系统上作为mapReduce的输入文件: ./bin/hadoop fs -put ./test /wordCountInput 该命令执行HDFS的命令将本地的文件test放置到HDFS的根目录下wordCountInput文件。通过ls命令查看是否执行成功: linuxidc@linuxidc-VirtualBox:~/workplace/hadoop/hadoop-2.6.0$ ./bin/hadoop fs -ls / Found 1 items -rw-r--r-- 1 linuxidc supergroup 1400 2015-01-20 13:05 /wordCountInput mapReduce的测试包在share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar中,这是多个测试程序打包而成的jar文件,我们使用wordCount功能执行单词统计。 ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /wordCountInput /wordCountOutput 这个命令使用hadoop的mapReduce执行jar包中的wordcount程序,这个程序的输入是HDFS的/wordCountInput文件(如果这个文件是一个目录,那么输入就是该目录下的所有文件),输出放到HDFS的/wordCountOutput目录中。执行过程中打印很多INFO信息,我们看一下部分的输出信息:
15/01/20 13:09:29 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 15/01/20 13:09:29 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 15/01/20 13:09:29 INFO input.FileInputFormat: Total input paths to process : 1 15/01/20 13:09:30 INFO mapreduce.JobSubmitter: number of splits:1 15/01/20 13:09:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local810038734_0001 ... 15/01/20 13:09:33 INFO mapred.MapTask: Starting flush of map output 15/01/20 13:09:33 INFO mapred.MapTask: Spilling map output ... 15/01/20 13:09:34 INFO mapreduce.Job: map 100% reduce 0% ... 15/01/20 13:09:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local810038734_0001_r_000000_0 15/01/20 13:09:35 INFO mapred.LocalJobRunner: reduce task executor complete. 15/01/20 13:09:35 INFO mapreduce.Job: map 100% reduce 100% 15/01/20 13:09:36 INFO mapreduce.Job: Job job_local810038734_0001 completed successfully 15/01/20 13:09:36 INFO mapreduce.Job: Counters: 38 ... File Input Format Counters Bytes Read=1400 File Output Format Counters Bytes Written=1416
然后看一下结果的目录: linuxidc@linuxidc-VirtualBox:~/workplace/hadoop/hadoop-2.6.0$ ./bin/hadoop fs -ls /wordCountOutput Found 2 items -rw-r--r-- 1 linuxidc supergroup 0 2015-01-20 13:09 /wordCountOutput/_SUCCESS -rw-r--r-- 1 linuxidc supergroup 1416 2015-01-20 13:09 /wordCountOutput/part-r-00000 可以看到这个目录下有两个文件,其中part-r-00000就是我们的执行结果:
linuxidc@linuxidc-VirtualBox:~/workplace/hadoop/hadoop-2.6.0$ ./bin/hadoop fs -cat /wordCountOutput/part-r-00000 Hadoop 5 The 5 a 4 and 7 for 4 is 5 now 3 proxy 2 release 3 the 9 to 4 user 3
这里只摘取了出现次数大于2的一些单词和它在上面的文件中的出现次数。它正确的统计了上面文件中出现的单词的个数,接着我们就可以自己写mapReduce程序来实现各种各样的计算功能了。