hadoop环境
环境信息
搭建方式:伪分布式环境
JDK: java1.8 路径为:/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home
hadoop版本:hadoop-3.2.3
配置免密登录
1、提供远程登录权限
2、创建ssh密钥
代码语言:javascript复制ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
3、将密钥放入授权目录
代码语言:javascript复制cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
4、使用ssh localhost验证,能够正常登录即可
下载hadoop
1、下载地址:https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
2、解压hadoop-3.2.3.tar.gz,我在本地的存放地址为~/Documents/java/hadoop-3.2.3
伪分布式搭建
本文采用s3作为文件系统存储,hdfs存储的方式不做赘述
1、修改hadoop-env.sh,添加下面java_home配置
代码语言:javascript复制export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home
2、修改core-site.xml,添加下面内容
代码语言:javascript复制<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3a://mybucket</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>*******</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>*******</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://s3.ap-northeast-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
3、修改hdfs-site.xml,添加以下内容
代码语言:javascript复制<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property><property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.https-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
</configuration>
4、修改mapred-site.xml,添加以下内容
代码语言:javascript复制<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5、修改yarn-site.xml,添加以下内容
代码语言:javascript复制<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.timeline-service.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
/Users/sheen/Documents/java/hadoop-3.2.3/etc/hadoop,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/lib/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/lib/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/lib/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/*,
/Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
填坑操作
1、hadoop yarn使用s3作为文件系统,当提交hive任务执行时,会出现下面问题
代码语言:javascript复制java.io.IOException: Resource s3a://yarn/user/root/DistributedShell/application_1641533299713_0002/ExecScript.sh changed on src filesystem (expected 1641534006000, was 1641534011000
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
问题原因:这个错误出现在hadoop-yarn-common包下的org.apache.hadoop.yarn.util.FSDownload类中,在s3在复制文件的过程中会改变文件的时间戳(hdfs不会)
代码语言:javascript复制private void verifyAndCopy(Path destination)
throws IOException, YarnException {
final Path sCopy;
try {
sCopy = resource.getResource().toPath();
} catch (URISyntaxException e) {
throw new IOException("Invalid resource", e);
}
FileSystem sourceFs = sCopy.getFileSystem(conf);
FileStatus sStat = sourceFs.getFileStatus(sCopy);
if (sStat.getModificationTime() != resource.getTimestamp()) {
throw new IOException("Resource " sCopy " changed on src filesystem"
" - expected: "
""" Times.formatISO8601(resource.getTimestamp()) """
", was: "
""" Times.formatISO8601(sStat.getModificationTime()) """
", current time: " """ Times.formatISO8601(Time.now()) """);
}
if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
throw new IOException("Resource " sCopy
" is not publicly accessible and as such cannot be part of the"
" public cache.");
}
}
downloadAndUnpack(sCopy, destination);
}
解决方案:
1、github下载hadoop代码,地址:https://github.com/apache/hadoop
2、切换到branch-3.2.3分支,修改hadoop/hadoop-yarn/hadoop-yarn-common的org.apache.hadoop.yarn.util.FSDownload类代码
代码语言:javascript复制private void verifyAndCopy(Path destination)
throws IOException, YarnException {
final Path sCopy;
try {
sCopy = resource.getResource().toPath();
} catch (URISyntaxException e) {
throw new IOException("Invalid resource", e);
}
FileSystem sourceFs = sCopy.getFileSystem(conf);
FileStatus sStat = sourceFs.getFileStatus(sCopy);
if (sStat.getModificationTime() != resource.getTimestamp()) {
/*
throw new IOException("Resource " sCopy " changed on src filesystem"
" - expected: "
""" Times.formatISO8601(resource.getTimestamp()) """
", was: "
""" Times.formatISO8601(sStat.getModificationTime()) """
", current time: " """ Times.formatISO8601(Time.now()) """);
*/
LOG.warn("Resource " sCopy " changed on src filesystem"
" - expected: "
""" Times.formatISO8601(resource.getTimestamp()) """
", was: "
""" Times.formatISO8601(sStat.getModificationTime()) """
", current time: " """ Times.formatISO8601(Time.now()) """
". Stop showing exception here, use a warning instead.");
}
if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
throw new IOException("Resource " sCopy
" is not publicly accessible and as such cannot be part of the"
" public cache.");
}
}
downloadAndUnpack(sCopy, destination);
}
3、重新编译打包hadoop-yarn-common
4、将打好hadoop-yarn-common-3.2.3.jar复制到hadoop-3.2.3/share/hadoop/yarn目录下,替换掉原先的的包
hive环境
下载hive
1、下载hive,地址:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
2、解压apache-hive-3.1.2-bin.tar.gz,本地存放目录为:~/Documents/java/apache-hive-3.1.2-bin
hive配置
1、下载mysql连接,并存放在hive的lib目录下
代码语言:javascript复制cd ~/Document/apache-hive-3.1.2-bin/lib
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.16/mysql-connector-java-8.0.16-sources.jar
2、从hadoop添加支持s3的jar包,这里使用软连接
代码语言:javascript复制mkdir ~/Documents/java/apache-hive-3.1.2-bin/auxlib
ln -s ~/Documents/java/hadoop-3.2.3/share/hadoop/tools/lib/*aws* ~/Documents/java/apache-hive-3.1.2-bin/auxlib/
3、修改hive_env.sh,添加以下内容
代码语言:javascript复制export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home
export HADOOP_HOME=/Users/sheen/Documents/java/hadoop-3.2.3
export HIVE_HOME=/Users/sheen/Documents/java/apache-hive-3.1.2-bin
export HIVE_AUX_JARS_PATH=$HIVE_HOME/auxlib
4、新增hive-site.xml文件,并配置以下内容
代码语言:javascript复制<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/hive/tmp</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/hive/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/hive/tmp</value>
</property>
</configuration>
5、初始化hive元数据
代码语言:javascript复制./bin/schematool -initSchema -dbType mysql -userName root -passWord root
6、在hive/conf下新建core-site.xml文件,添加以下内容
代码语言:javascript复制<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3a://mybucket</value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<description>The credential provider type.</description>
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.bucket.mybucket.access.key</name>
<value>*****</value>
</property>
<property>
<name>fs.s3a.bucket.mybucket.secret.key</name>
<value>******</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://s3.ap-northeast-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
</configuration>
小细节:
hadoop和hive的fs.defaultFS最好配一样,且如果fs.defaultFS配的时候有带桶,比如s3a://mybucket,带了mybucket这个桶,那么fs.s3a.secret.key必须配成fs.s3a.bucket.mybucket.secret.key。不然找不到secretKey,accessKey也是如此。
启动hadoop hive
1、启动hadoop,出现error是hdfs的报错,无影响,无视就行
代码语言:javascript复制~/Documents/java/hadoop-3.2.3/sbin/start-all.sh
访问localhost:8088
2、启动hive
代码语言:javascript复制~/Documents/java/apache-hive-3.1.2-bin/bin/hive