安装JDK
Java是Hadoop的主要先决条件。首先,应该使用命令“java-version”验证 java 存在在系统中。Java version 命令的语法如下。
代码语言:javascript复制java -version
如果Java还未安装在系统中,那么按照下面的给出的步骤来安装Java。
1.下载Java(JDK<最新版> - X64.tar.gz)通过访问以下链接
代码语言:javascript复制https://www.oracle.com/java/technologies/downloads/
1.解压文件
代码语言:javascript复制tar zxf jdk-version-linux-x64.gz
1.设置环境变量
代码语言:javascript复制export JAVA_HOME=/usr/local/jdk_path
export PATH=PATH:$JAVA_HOME/bin
下载Hadoop
代码语言:javascript复制https://hadoop.apache.org/releases.html
本文以Hadoop 3.0.3为例,讲述Hadoop搭建。
Hadoop操作模式
Hadoop支持以下3种模式:
1.本地/独立模式:下载Hadoop在系统中,默认情况下之后,它会被配置在一个独立的模式,用于运行Java程序。2.模拟分布式模式:这是在单台机器的分布式模拟。Hadoop守护每个进程,如 hdfs, yarn, MapReduce 等,都将作为一个独立的java程序运行。这种模式对开发非常有用。3.完全分布式模式:这种模式是完全分布式的最小两台或多台计算机的集群。
单机模式
由单个JVM运行守护进程。单据模式适合于开发期间运行MapReduce程序,因为它很容易进行测试和调试。
代码语言:javascript复制#进入Hadoop解压目录
cd hadoop
# 查看Hadoop版本号
bin/hadoop version
# 本文Hadoop版本号
Hadoop 3.0.3
Source code repository https://yjzhangal@git-wip-us.apache.org/repos/asf/hadoop.git -r 37fd7d752db73d984dc31e0cdfd590d252f5e075
Compiled by yzhang on 2018-05-31T17:12Z
Compiled with protoc 2.5.0
From source with checksum 736cdcefa911261ad56d2d120bf1fa
This command was run using /Users/michael/Downloads/install/hadoop-3.0.3/share/hadoop/common/hadoop-common-3.0.3.jar
Hadoop默认模式为非分布式模式(本地模式),无需进行其他配置即可运行。Hadoop附带了丰富的例子。
让我们来看看Hadoop的一个简单例子。Hadoop安装提供了下列示例 MapReduce jar 文件,它提供了MapReduce的基本功能,并且可以用于计算,像π值,字计数在文件等等
1.创建输入目录和输入文件。可以在任何地方创建此输入目录用来工作。
代码语言:javascript复制# 拷贝Hadoop解压目录下的txt文件到input文件夹下。
# 这些文件已从Hadoop安装主目录被复制。为了实验,可以有不同大型的文件集。
mkdir input
cp $HADOOP_HOME/*.txt input
ls -l input
# 输出结果如下:
-rw-r--r--@ 1 michael staff 147066 4 4 19:35 LICENSE.txt
-rw-r--r--@ 1 michael staff 20891 4 4 19:35 NOTICE.txt
-rw-r--r--@ 1 michael staff 1366 4 4 19:35 README.txt
1.让我们启动Hadoop进程计数在所有在输入目录中可用的文件的单词总数,具体如下:
代码语言:javascript复制# 执行MapReduce程序:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-3.0.3.jar wordcount input ouput
# 输出内容如下:
2022-04-04 19:38:21,711 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2022-04-04 19:38:21,711 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2022-04-04 19:38:21,711 INFO mapred.MapTask: soft limit at 83886080
2022-04-04 19:38:21,711 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2022-04-04 19:38:21,711 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2022-04-04 19:38:21,716 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2022-04-04 19:38:21,771 INFO mapred.LocalJobRunner:
2022-04-04 19:38:21,771 INFO mapred.MapTask: Starting flush of map output
2022-04-04 19:38:21,771 INFO mapred.MapTask: Spilling map output
2022-04-04 19:38:21,771 INFO mapred.MapTask: bufstart = 0; bufend = 228751; bufvoid = 104857600
2022-04-04 19:38:21,771 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26128588(104514352); length = 85809/6553600
2022-04-04 19:38:21,865 INFO mapred.MapTask: Finished spill 0
2022-04-04 19:38:21,882 INFO mapred.Task: Task:attempt_local1039131726_0001_m_000000_0 is done. And is in the process of committing
2022-04-04 19:38:21,885 INFO mapred.LocalJobRunner: map
2022-04-04 19:38:21,885 INFO mapred.Task: Task 'attempt_local1039131726_0001_m_000000_0' done.
2022-04-04 19:38:21,891 INFO mapred.Task: Final Counters for attempt_local1039131726_0001_m_000000_0: Counters: 18
File System Counters
FILE: Number of bytes read=463549
FILE: Number of bytes written=832511
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2745
Map output records=21453
Map output bytes=228751
Map output materialized bytes=46198
Input split bytes=133
Combine input records=21453
Combine output records=2960
Spilled Records=2960
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
Total committed heap usage (bytes)=196608000
File Input Format Counters
Bytes Read=147066
2022-04-04 19:38:21,891 INFO mapred.LocalJobRunner: Finishing task: attempt_local1039131726_0001_m_000000_0
2022-04-04 19:38:21,892 INFO mapred.LocalJobRunner: Starting task: attempt_local1039131726_0001_m_000001_0
2022-04-04 19:38:21,893 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2022-04-04 19:38:21,893 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2022-04-04 19:38:21,894 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
2022-04-04 19:38:21,894 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
2022-04-04 19:38:21,895 INFO mapred.MapTask: Processing split: file:/Users/michael/Downloads/install/hadoop-3.0.3/input/NOTICE.txt:0 20891
2022-04-04 19:38:21,955 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2022-04-04 19:38:21,955 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2022-04-04 19:38:21,955 INFO mapred.MapTask: soft limit at 83886080
2022-04-04 19:38:21,955 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2022-04-04 19:38:21,955 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2022-04-04 19:38:21,957 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2022-04-04 19:38:21,963 INFO mapred.LocalJobRunner:
2022-04-04 19:38:21,963 INFO mapred.MapTask: Starting flush of map output
2022-04-04 19:38:21,963 INFO mapred.MapTask: Spilling map output
2022-04-04 19:38:21,963 INFO mapred.MapTask: bufstart = 0; bufend = 29431; bufvoid = 104857600
2022-04-04 19:38:21,963 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26204732(104818928); length = 9665/6553600
2022-04-04 19:38:21,981 INFO mapred.MapTask: Finished spill 0
2022-04-04 19:38:22,038 INFO mapred.Task: Task:attempt_local1039131726_0001_m_000001_0 is done. And is in the process of committing
2022-04-04 19:38:22,039 INFO mapred.LocalJobRunner: map
2022-04-04 19:38:22,039 INFO mapred.Task: Task 'attempt_local1039131726_0001_m_000001_0' done.
2022-04-04 19:38:22,039 INFO mapred.Task: Final Counters for attempt_local1039131726_0001_m_000001_0: Counters: 18
File System Counters
FILE: Number of bytes read=484856
FILE: Number of bytes written=847775
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=583
Map output records=2417
Map output bytes=29431
Map output materialized bytes=15232
Input split bytes=132
Combine input records=2417
Combine output records=844
Spilled Records=844
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=48
Total committed heap usage (bytes)=400556032
File Input Format Counters
Bytes Read=20891
1.将做必要的处理并保存输出在output/part-r00000文件中,可以通过查询使用:
代码语言:javascript复制cat output/*
它会列出了所有的单词以及它们在所有输入目录中的文件提供总计数。
代码语言:javascript复制""AS 2
"AS 23
"AS-IS" 1
"Adaptation" 1
"COPYRIGHTS 1
"Collection" 1
"Collective 1
"Contribution" 2
"Contributor" 2
"Creative 1
"Derivative 2
"Distribute" 1
"French 2
"GCC 1
"JDOM" 2
"JDOM", 1
"Java 2
"LICENSE"). 2
"Legal 1
"License" 1
"License"); 3
"Licensed 1
"Licensor" 3
"Losses") 1
"NOTICE" 1
"Not 1
"Object" 1
"Original 2
"Program" 1
"Publicly 1
"Recipient" 1
"Reproduce" 1