Mac hadoop + hive整合s3-伪分布式环境

2022-04-26 16:59:01 浏览数 (2)

hadoop环境

环境信息

搭建方式:伪分布式环境

JDK: java1.8 路径为:/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home

hadoop版本:hadoop-3.2.3

配置免密登录

1、提供远程登录权限

2、创建ssh密钥

代码语言:javascript复制
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

3、将密钥放入授权目录

代码语言:javascript复制
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

4、使用ssh localhost验证,能够正常登录即可

下载hadoop

1、下载地址:https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz

2、解压hadoop-3.2.3.tar.gz,我在本地的存放地址为~/Documents/java/hadoop-3.2.3

伪分布式搭建

本文采用s3作为文件系统存储,hdfs存储的方式不做赘述

1、修改hadoop-env.sh,添加下面java_home配置

代码语言:javascript复制
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home

2、修改core-site.xml,添加下面内容

代码语言:javascript复制
<configuration>
   <property>
    <name>fs.defaultFS</name>
    <value>s3a://mybucket</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>*******</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>*******</value>
  </property>
  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
   <property>
    <name>fs.s3a.endpoint</name>
    <value>http://s3.ap-northeast-1.amazonaws.com</value>
  </property> 
  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
  <property>
      <name>hadoop.tmp.dir</name>
      <value>/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>

3、修改hdfs-site.xml,添加以下内容

代码语言:javascript复制
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
        <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
        <value>false</value>
    </property><property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.servicerpc-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.http-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.https-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.use.datanode.hostname</name>
        <value>false</value>
    </property>
</configuration>

4、修改mapred-site.xml,添加以下内容

代码语言:javascript复制
<configuration>
    <property>
         <name>mapreduce.framework.name</name>
         <value>yarn</value>
     </property>
</configuration>

5、修改yarn-site.xml,添加以下内容

代码语言:javascript复制
<configuration>
 
<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
  
     
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost</value>
    </property>
    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property> 
    <property>
        <name>mapred.map.output.compress.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
    <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.resourcemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.nodemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.nodemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.timeline-service.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.application.classpath</name>
        <value>
            /Users/sheen/Documents/java/hadoop-3.2.3/etc/hadoop,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/lib/*
        </value>
    </property>
</configuration>

填坑操作

1、hadoop yarn使用s3作为文件系统,当提交hive任务执行时,会出现下面问题

代码语言:javascript复制
java.io.IOException: Resource s3a://yarn/user/root/DistributedShell/application_1641533299713_0002/ExecScript.sh changed on src filesystem (expected 1641534006000, was 1641534011000
    at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

问题原因:这个错误出现在hadoop-yarn-common包下的org.apache.hadoop.yarn.util.FSDownload类中,在s3在复制文件的过程中会改变文件的时间戳(hdfs不会)

代码语言:javascript复制
private void verifyAndCopy(Path destination)
    throws IOException, YarnException {
  final Path sCopy;
  try {
    sCopy = resource.getResource().toPath();
  } catch (URISyntaxException e) {
    throw new IOException("Invalid resource", e);
  }
  FileSystem sourceFs = sCopy.getFileSystem(conf);
  FileStatus sStat = sourceFs.getFileStatus(sCopy);
  if (sStat.getModificationTime() != resource.getTimestamp()) {
    throw new IOException("Resource "   sCopy   " changed on src filesystem"  
        " - expected: "  
        """   Times.formatISO8601(resource.getTimestamp())   """  
        ", was: "  
        """   Times.formatISO8601(sStat.getModificationTime())   """  
        ", current time: "   """   Times.formatISO8601(Time.now())   """);
  }
  if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
    if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
      throw new IOException("Resource "   sCopy  
          " is not publicly accessible and as such cannot be part of the"  
          " public cache.");
    }
  }
 
  downloadAndUnpack(sCopy, destination);
}

解决方案:

1、github下载hadoop代码,地址:https://github.com/apache/hadoop

2、切换到branch-3.2.3分支,修改hadoop/hadoop-yarn/hadoop-yarn-common的org.apache.hadoop.yarn.util.FSDownload类代码

代码语言:javascript复制
private void verifyAndCopy(Path destination)
    throws IOException, YarnException {
  final Path sCopy;
  try {
    sCopy = resource.getResource().toPath();
  } catch (URISyntaxException e) {
    throw new IOException("Invalid resource", e);
  }
  FileSystem sourceFs = sCopy.getFileSystem(conf);
  FileStatus sStat = sourceFs.getFileStatus(sCopy);
  if (sStat.getModificationTime() != resource.getTimestamp()) {
    /*
    throw new IOException("Resource "   sCopy   " changed on src filesystem"  
        " - expected: "  
        """   Times.formatISO8601(resource.getTimestamp())   """  
        ", was: "  
        """   Times.formatISO8601(sStat.getModificationTime())   """  
        ", current time: "   """   Times.formatISO8601(Time.now())   """);
    */
    LOG.warn("Resource "   sCopy   " changed on src filesystem"  
            " - expected: "  
            """   Times.formatISO8601(resource.getTimestamp())   """  
            ", was: "  
            """   Times.formatISO8601(sStat.getModificationTime())   """  
            ", current time: "   """   Times.formatISO8601(Time.now())   """  
            ". Stop showing exception here, use a warning instead.");
  }
  if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
    if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
      throw new IOException("Resource "   sCopy  
          " is not publicly accessible and as such cannot be part of the"  
          " public cache.");
    }
  }
 
  downloadAndUnpack(sCopy, destination);
}

3、重新编译打包hadoop-yarn-common

4、将打好hadoop-yarn-common-3.2.3.jar复制到hadoop-3.2.3/share/hadoop/yarn目录下,替换掉原先的的包

hive环境

下载hive

1、下载hive,地址:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

2、解压apache-hive-3.1.2-bin.tar.gz,本地存放目录为:~/Documents/java/apache-hive-3.1.2-bin

hive配置

1、下载mysql连接,并存放在hive的lib目录下

代码语言:javascript复制
cd ~/Document/apache-hive-3.1.2-bin/lib
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.16/mysql-connector-java-8.0.16-sources.jar

2、从hadoop添加支持s3的jar包,这里使用软连接

代码语言:javascript复制
mkdir ~/Documents/java/apache-hive-3.1.2-bin/auxlib
ln -s ~/Documents/java/hadoop-3.2.3/share/hadoop/tools/lib/*aws* ~/Documents/java/apache-hive-3.1.2-bin/auxlib/

3、修改hive_env.sh,添加以下内容

代码语言:javascript复制
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home
export HADOOP_HOME=/Users/sheen/Documents/java/hadoop-3.2.3
export HIVE_HOME=/Users/sheen/Documents/java/apache-hive-3.1.2-bin
export HIVE_AUX_JARS_PATH=$HIVE_HOME/auxlib

4、新增hive-site.xml文件,并配置以下内容

代码语言:javascript复制
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
</property>
<property>
  <name>hive.querylog.location</name>
  <value>/hive/tmp</value>
</property>
<property>
  <name>hive.exec.local.scratchdir</name>
  <value>/hive/tmp</value>
</property>
<property>
  <name>hive.downloaded.resources.dir</name>
  <value>/hive/tmp</value>
</property>
</configuration>

5、初始化hive元数据

代码语言:javascript复制
./bin/schematool -initSchema -dbType mysql -userName root -passWord root

6、在hive/conf下新建core-site.xml文件,添加以下内容

代码语言:javascript复制
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>s3a://mybucket</value>
  </property>
  <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
  </property>
  <property>
    <name>fs.s3a.bucket.mybucket.access.key</name>
    <value>*****</value>
  </property>
  <property>
    <name>fs.s3a.bucket.mybucket.secret.key</name>
    <value>******</value>
  </property>
  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
   <property>
    <name>fs.s3a.endpoint</name>
    <value>http://s3.ap-northeast-1.amazonaws.com</value>
  </property>  
  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
</configuration>

小细节:

hadoop和hive的fs.defaultFS最好配一样,且如果fs.defaultFS配的时候有带桶,比如s3a://mybucket,带了mybucket这个桶,那么fs.s3a.secret.key必须配成fs.s3a.bucket.mybucket.secret.key。不然找不到secretKey,accessKey也是如此。

启动hadoop hive

1、启动hadoop,出现error是hdfs的报错,无影响,无视就行

代码语言:javascript复制
~/Documents/java/hadoop-3.2.3/sbin/start-all.sh

访问localhost:8088

2、启动hive

代码语言:javascript复制
~/Documents/java/apache-hive-3.1.2-bin/bin/hive

0 人点赞