环境:
- hadoop 3.2.0
- flink 1.11.4-bin-scala_2.11
- hudi 0.8.0
本文基于上述组件版本使用flink插入数据到hudi数据湖中。为了确保以下各步骤能够成功完成,请确保hadoop集群正常启动。
确保已经配置环境变量HADOOP_CLASSPATH
对于开源版本hadoop,HADOOP_CLASSPATH配置为:
代码语言:javascript复制export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/client/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/etc/hadoop/*
本文使用的hdfs为高可用集群,对应hdfs为:hdfs://mycluster
本地安装flink集群
flink下载
代码语言:javascript复制wget https://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.11.4/flink-1.11.4-bin-scala_2.11.tgz
tar zxvf flink-1.11.4-bin-scala_2.11.tgz
下载hudi相关jar包,需要下载hudi-flink-bundle_2.11-0.8.0.jar、commons-logging-1.2.jar、htrace-core-3.1.0-incubating.jar以及htrace-core4-4.1.0-incubating.jar这四个jar包到flink的lib目录下,其中
代码语言:javascript复制cd flink-1.11.4/lib
wget https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-flink-bundle_2.11/0.8.0/hudi-flink-bundle_2.11-0.8.0.jar
wget https://repo1.maven.org/maven2/commons-logging/commons-logging/1.2/commons-logging-1.2.jar
wget https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
wget https://repo1.maven.org/maven2/org/apache/htrace/htrace-core4/4.1.0-incubating/htrace-core4-4.1.0-incubating.jar
修改配置文件
vi conf/workers,写入四个localhost
代码语言:javascript复制localhost
localhost
localhost
localhost
vi conf/flink-conf.yaml,修改taskmanager.numberOfTaskSlots的值为4
代码语言:javascript复制taskmanager.numberOfTaskSlots: 4
启动flink集群
代码语言:javascript复制bin/start-cluster.sh
启动flink-sql client
执行以下命令启动flink sql
代码语言:javascript复制./bin/sql-client.sh embedded -j ./lib/hudi-flink-bundle_2.11-0.8.0.jar shell
创建t1表
代码语言:javascript复制create table t1(
uuid VARCHAR(20),
name VARCHAR(20),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path' = 'hdfs://mycluster/tmp/t1',
'table.type' = 'MERGE_ON_READ'
);
插入数据到t1表
代码语言:javascript复制 INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
数据更新
代码语言:javascript复制insert into t1 values ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');
数据查询
代码语言:javascript复制select * from t1 limit 10;
查询结果:
查看hdfs上对应表的分区
执行命令:
代码语言:javascript复制hdfs dfs -ls /tmp/t1
得到
本文为从大数据到人工智能博主「xiaozhch5」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://cloud.tencent.com/developer/article/1936504