Atlas自动感知hivesql及sparksql血缘实践

2023-11-07 17:26:04 浏览数 (2)

这周真的是忙出天际,趁这会儿下班,赶紧补补文档,之前有说要整整血缘这块儿,源码都看好了,但没有展示的地方。

后来调研说atlas不错,就想着用atlas跑一把,看能不能打通,最后经过状况百出的编译,还真是跑通了,借助各种开源组件,atlas能自动感知hivesql及sparksql的表血缘和字段血缘,真的太棒了!!

有这样一套环境,至少对于想研究这块或者想要做这块二次开发的同学来说可太友好,读读atlas,kyuubi源码,再研究下hivesql及sparksql执行计划,就可以开搞了。

先说下环境

操作系统是macos

atlas版本:2.3.0

hadoop版本:3.3.6

mysql版本:5.7.25

hive版本:3.1.3

spark版本:3.3.3

kyuubi版本:1.8.1

其他:

maven版本:3.8.7

jdk版本:1.8.0_201

部署过程

1、Atlas安装

  • 下载和编译

官网:https://atlas.apache.org/#/

下载2.3.0版本源码https://dlcdn.apache.org/atlas/2.3.0/apache-atlas-2.3.0-sources.tar.gz

atlas没有提供安装包,需要我们自己编译

编译文档:https://atlas.apache.org/#/BuildInstallation

如果自己的maven和jdk环境没有问题,按照官网上的文档编译就不会有啥问题,官网有几种编译形式,我这儿选择了内嵌Hbase和Solr的形式(这样就不用再单独安装hbase和solr了)

代码语言:javascript复制
解压:
tar xvfz apache-atlas-2.3.0-sources.tar.gz

编译:
cd apache-atlas-sources-2.3.0/
mvn clean -DskipTests package -Pdist,embedded-hbase-solr
  • 安装atlas

安装文档:https://atlas.apache.org/#/Installation

我们需要安装的tar包在 apache-atlas-sources-2.3.0/distro/target下

apache-atlas-2.3.0-server.tar.gz

代码语言:javascript复制
解压
tar -zxvf apache-atlas-2.3.0-server.tar.gz

修改atlas-env.sh
cd apache-atlas-2.3.0

vim conf/atlas-env.sh

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home
export MANAGE_LOCAL_HBASE=true
export MANAGE_LOCAL_SOLR=true:

配制环境变量
vim /etc/profile

export ATLAS_HOME=/xx/apache-atlas-2.3.0
export PATH=.:$MAVEN_HOME/bin:$JAVA_HOME/bin:$ATLAS_HOME/bin:$ZOOKEEPER_HOME/bin:$PROTOBUF_HOME/bin:$MYSQL_HOME/bin:$SPARK_HOME/sbin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

使用环境变量起作用
source /etc/profile

  • 配制hive-hook

咱这次是跑通hive、sparksql两个hook,来实现血缘关系自动导入atlas,这里先配制hive-hook

hive-hook的tar包在apache-atlas-sources-2.3.0/distro/target下

apache-atlas-2.3.0-hive-hook.tar.gz

代码语言:javascript复制
解压
tar -zxvf apache-atlas-2.3.0-hive-hook.tar.gz

复制atlas-hive-hook安装包下的hook文件夹和hook-bin文件夹到 atlas的安装目录下
cp -r apache-atlas-hive-hook-2.3.0/hook  apache-atlas-2.3.0
cp -r apache-atlas-hive-hook-2.3.0/hook-bin  apache-atlas-2.3.0

修改配制文件atlas-application.properties
cd apache-atlas-2.3.0

vim conf/atlas-application.properties

######### Hive Hook Configs #######
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary

2、Hadoop安装

官网:https://hadoop.apache.org/

下载:hadoop-3.3.6.tar.gz

  • 各种配制
代码语言:javascript复制
解压
tar -zxvf hadoop-3.3.6.tar.gz

配制环境变量
vim /etc/profile

export HADOOP_HOME=/xx/hadoop-3.3.6
export PATH=.:$MAVEN_HOME/bin:$JAVA_HOME/bin:$ATLAS_HOME/bin:$ZOOKEEPER_HOME/bin:$PROTOBUF_HOME/bin:$MYSQL_HOME/bin:$SPARK_HOME/sbin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

使环境变量起作用
source /etc/profile

修改hadoop配制文件
cd $HADOOP_HOME/etc/hadoop/

vim core-site.xml
<configuration>
 <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/xx/data/hadoop/tmp</value>
 </property>
<property>
        <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
</property>
</configuration>

vim hdfs-site.xml
<configuration>
<property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/xx/data/hadoop/tmp/dfs/name</value>   --放临时数据
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/xx/data/hadoop/tmp/dfs/data</value>  --放临时数据
    </property>
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>
</configuration>

vim mapred-site.xml
<configuration>
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
    <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
<property>
    <name>yarn.application.classpath</name>
    <value>
        /xx/soft/hadoop-3.3.6/etc/hadoop,   --换成自己路径
        /xx/soft/hadoop-3.3.6/share/hadoop/common/lib/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/common/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/hdfs,
        /xx/soft/hadoop-3.3.6/share/hadoop/hdfs/lib/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/hdfs/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/mapreduce/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/yarn,
        /xx/soft/hadoop-3.3.6/share/hadoop/yarn/lib/*,
        /xx/soft/hadoop-3.3.6/share/hadoop/yarn/*    
</value>
  </property>
</configuration>


vim yarn-site.xml
<configuration>
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.scheduler.minimum-allocation-mb</name>
   <value>2048</value>
   <description>default value is 1024</description>
</property>
</configuration>

ssh免密码登录
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa  
cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
chmod 0600~/.ssh/authorized_keys
  • 启动和停止
代码语言:javascript复制
格式化namenode  
hdfs namenode -format

启动
start-dfs.sh  
start-yarn.sh

jps看进程

xx@C02D83S2ML85 hadoop % jps
29808 SecondaryNameNode
29664 DataNode
70835 Jps
30004 ResourceManager
30100 NodeManager
30884 Atlas
32917 Launcher
30549 HMaster
38870 Master
38921 Worker
29565 NameNode
1869 

看web界面
http://localhost:9870/dfshealth.html#tab-overview

关闭:
stop-dfs.sh
stop-yarn.sh

其他hadoop命令测试:
hdfs dfs -mkdir /wordcount
hdfs dfs -put ~/testdata/wordcount /wordcount
  • jps看进程:
  • Web:

3、Mysql安装

官网:https://dev.mysql.com/

下载:https://downloads.mysql.com/archives/community/

下载 mysql-5.7.25-macos10.14-x86_64 .dmg

双击安装,安装后,再重新设置下密码(自动生成的密码太复杂,记不住)

登录测试:

mysql -u root -p123456

4、Hive安装

官网:https://hive.apache.org/index.html

下载安装包apache-hive-3.1.3-bin.tar.gz:https://dlcdn.apache.org/hive/

  • 配置
代码语言:javascript复制
解压
tar -zxvf  apache-hive-3.1.3-bin.tar.gz

配制环境变量
vim /etc/profile

export HIVE_HOME=/xx/apache-hive-2.3.9-bin
export PATH=.:$MAVEN_HOME/bin:$JAVA_HOME/bin:$ATLAS_HOME/bin:$ZOOKEEPER_HOME/bin:$PROTOBUF_HOME/bin:$MYSQL_HOME/bin:$SPARK_HOME/sbin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

使环境变量起作用
source /etc/profile


配制hive-site.xml
cd apache-hive-3.1.3-bin/conf

cp hive-default.xml.template hive-site.xml 
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
  </property>

   <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>


下载mysql驱动包,mysql-connector-java-5.1.49.jar(没找到5.7.25版本的,用5.1.49也可以跑起来


	

0 人点赞