hive性能调优读书笔记 - 问题排查、调优、数据处理模式

learn from 《Hive性能调优实战》

本文使用的 hive 版本 2.3.2

1. 调优思路

分析定位性能瓶颈优化存储、优化执行过程、优化作业调度
性能监控和告警，开源工具 Grafana、Ambari等

2. 环境搭建

Cloudera Docker 搭建伪分布式环境本人是 win 安装的 docker desktop，在 wsl 中操作

代码语言：javascript复制

docker pull cloudera/quickstart:latest
docker run --hostname=quickstart.Cloudera --privileged=true -it -p 80:80 -p 8888:8888 -p 8088:8088 Cloudera/quickstart /usr/bin/Docker-quickstart

代码语言：javascript复制

hdfs dfsadmin -report

docker-compose部署hive、kafka服务

3. 查看执行计划

3.1 explain

explain sql

代码语言：javascript复制

explain
select s_age, count(1) num from student_tb_orc
where s_age<30 and s_name like '%红%'
group by s_age

代码语言：javascript复制

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce  # 执行引擎
      Map Operator Tree: # map 操作
          TableScan
            alias: student_tb_orc  # 扫描的表
            Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  # 过滤操作
              predicate: ((s_age < 30) and (s_name like '%???%')) (type: boolean)
              # 过滤时使用的谓词
              Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
              # 处理的数据量，预估值
              Select Operator  # 在之前的结果集上列投影，筛选列
                expressions: s_age (type: bigint)  # 筛选的列
                outputColumnNames: _col0 # 输出的列名字
                Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                Group By Operator # 分组聚合
                  aggregations: count(1)  # 分组聚合使用的算法
                  keys: _col0 (type: bigint)  # 分组所在的列
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator # 对之前结果聚合后的输出信息
                    key expressions: _col0 (type: bigint)  # 聚合后的 key， s_age
                    sort order:    # 输出是否排序，  正序， - 倒序
                    Map-reduce partition columns: _col0 (type: bigint) # 分区列
                    Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col1 (type: bigint) # 聚合后的 value, count(1)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1666666 Data size: 2535932318 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false  # 文件输出的结果是否进行压缩
            Statistics: Num rows: 1666666 Data size: 2535932318 Basic stats: COMPLETE Column stats: NONE
            table:  # 当前操作表的信息
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                # 输入、输出文件的类型
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                # serde 读取数据的序列化和反序列化的方式

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

上面描述的是计算引擎的执行逻辑

如果想看 spark 引擎的执行计划

代码语言：javascript复制

set hive.execution.engine=spark;
explain
select s_age, count(1)
from student_tb_txt
group by s_age;

3.2 explain extended

更具体的信息

抽象语法树， hive 3.0 以后用 explain ast

3.3 explain dependency

sql需要的数据来源

代码语言：javascript复制

hive> explain dependency
    > select s_age, count(1) num from student_tb_orc
    > where s_age < 30 and s_name like '%red%' group by s_age;
OK
{"input_tables":[{"tablename":"default@student_tb_orc","tabletype":"MANAGED_TABLE"}],"input_partitions":[]}
# 库 @ 表， 表类型：内部表，分区
Time taken: 2.01 seconds, Fetched: 1 row(s)

代码语言：javascript复制

hive> explain dependency
    > select s_age, count(1) num from student_orc_partition
    > where s_age < 30 and s_name like '%red%' group by s_age;
OK
{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"}]
# # 库 @ 表 @ 分区列的值}
Time taken: 0.396 seconds, Fetched: 1 row(s)

使用场景：

排除分区数据异常
弄清楚数据输入

代码语言：javascript复制

explain dependency
select a.s_no from student_orc_partition a 
left join student_orc_partition_1 b
on a.s_no=b.s_no 
and a.s_age=b.s_age 
and a.s_age>=22 
and a.s_age<=23;

代码语言：javascript复制

{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=20"},
{"partitionName":"default@student_orc_partition_1@s_age=21"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"},
{"partitionName":"default@student_orc_partition_1@s_age=24"},
{"partitionName":"default@student_orc_partition_1@s_age=25"},
{"partitionName":"default@student_orc_partition_1@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=27"}]}

a 左连接 b，过滤条件是针对左表 a，完全没有起到作用，所有分区都要扫描

代码语言：javascript复制

explain dependency
select a.s_no from student_orc_partition a 
left join student_orc_partition_1 b
on a.s_no=b.s_no 
and a.s_age=b.s_age 
and b.s_age>=22 
and b.s_age<=23;

代码语言：javascript复制

{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"}]}

a 左连接 b，过滤条件是针对右表 b，b表只扫描了指定分区，a 表需要全表扫描

建议：尽早过滤掉不需要的数据和列

代码语言：javascript复制

explain dependency
select a.s_no
from (select s_no, s_age
      from student_orc_partition
      where s_age >= 22
        and s_age <= 23) a
left join student_orc_partition_1 b
on a.s_no = b.s_no and a.s_age = b.s_age;

代码语言：javascript复制

{"input_tables":[{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"}]}

两个表都只扫描了 2个分区的数据

3.4 explain authorization

查询权限相关的信息

代码语言：javascript复制

explain authorization
select variance(s_score) from student_tb_orc;

代码语言：javascript复制

INPUTS: 
  default@student_tb_orc # 输入库@表
OUTPUTS:  # 输出数据的路径
  file:/tmp/root/3cb0f818-35ba-44a0-9fb4-f71e4ce56b4d/hive_2023-03-11_06-00-29_288_4051189352612619596-1/-mr-10001
CURRENT_USER: # 当前用户
  root
OPERATION: # 数据操作是查询
  QUERY

3.5 explain vectorization

hive 2.3 以后可用，向量化执行，提高数据处理性能

代码语言：javascript复制

set hive.vectorized.execution.enabled=true;
explain vectorization only
select count(1) from student_tb_orc;

代码语言：javascript复制

PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator

如果改为 false

代码语言：javascript复制

PLAN VECTORIZATION:
  enabled: false
  enabledConditionsNotMet: [hive.vectorized.execution.enabled IS false]

还有其他命令 explain vectorization operator，explain vectorization expression

4. sql执行计划解读

select from where 这类sql，没有reduce阶段
select func(col) from where func(col), select operation from where operation 带函数、操作符的sql，跟上面一样也没有 reduce 阶段
带聚合函数的sql

代码语言：javascript复制

set hive.map.aggr=true;
explain select s_age, avg(s_score) from student_tb_orc
where s_age < 20
group by s_age;

分别设置 hive.map.aggr 开启、不开启查看日志

开启的话，在 map 阶段就有 reduce 操作

join sql

代码语言：javascript复制

explain
select a.s_no from student_tb_orc a inner join student_orc_partition b
on a.s_no=b.s_no;

代码语言：javascript复制

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: s_no (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order:  
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: s_no (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order:  
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Join Operator  # 连接操作
          condition map:
               Inner Join 0 to 1 
               # 0 表示map阶段输出的a数据集
               # 1 是b输出的数据集
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0
          Statistics: Num rows: 22000000 Data size: 33474320725 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 22000000 Data size: 33474320725 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

代码语言：javascript复制

select a.s_no from student_tb_orc a 
left semi join student_orc_partition b
on a.s_no=b.s_no;

select a.s_no from student_tb_orc a where a.s_no 
in (select s_no from student_orc_partition)

以上两个 sql 的效果是等效的，执行计划一致

代码语言：javascript复制

select a.s_no from student_tb_orc a where a.s_age in (21,22,23)
# in 里面的条件少，可以减少 map 和 shuffle 的数量
select a.s_no from student_tb_orc a where a.s_age in (21,21,22,23)

5. 数据处理模式

5.1 过滤模式

where 过滤，发生在 map 阶段，可以减少后序数据跨机器传输

代码语言：javascript复制

select count(s_age) from (
	select s_age, count(1) num
	from student_tb_seq
	group by s_age
) a
where s_age <30 and num>20

执行时间 268s

代码语言：javascript复制

select count(s_age) from (
	select s_age, count(1) num
	from student_tb_seq
	where s_age<30
	group by s_age
		having count(1)>20
) a

执行时间 223s

having 过滤，在聚合后
distinct 过滤，用于过滤重复数据，发生在 reduce 阶段
表过滤，一次读取，多次使用

代码语言：javascript复制

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
explain
insert into table student_stat partition(tp)
select s_age,max(s_birth) stat,'max' tp
from student_tb_seq
group by s_age
union all
select s_age,min(s_birth) stat, 'min' tp
from student_tb_seq
group by s_age;

以上sql需要读取两次原始表

代码语言：javascript复制

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
explain
from student_tb_seq
insert into table student_stat partition(tp)
select s_age,min(s_birth) stat,'max' stat
group by s_age
insert into table  student_stat partition(tp)
select s_age,max(s_birth) stat,'min' stat
group by s_age;

from ... select 形式，只读取了表一次

分区过滤，其实是在 map的上一个阶段（输入阶段进行路径过滤）
分桶过滤，对文件的过滤
列过滤，orc 、parquet 格式的数据，可以直接读取字段，不需要取出整行，再按分隔符切割后选取字段

5.2 聚合模式

distinct 模式，开启了 hive.map.aggr=true 时，使用 distinct 子句时，在 map 阶段就会开始局部的聚合，减少流转到下游的数据量

计数聚合模式 –count(col)，值为 null 的不会被计数，需要读取具体的数据 –count(*)，count(1)，不需要读取具体数据，是读取每行数据的偏移量

代码语言：javascript复制

explain
Select count(s_score) from student_tb_orc;

5.3 可计算中间结果的聚合

数据可以局部归并汇总，减少数据流转量例如，求 sum，max， min

5.4 不可计算中间结果的聚合

将多行数据聚合到一行当中，如 collect_list(),collect_set()

5.5 连接模式

repartition 连接，发生在 shuffle 和 reduce 阶段，map 任务读取 AB两个表的数据，按连接条件发往相同的 reduce，在 reduce 中计算合并的结果
map join 先读取小表A，构建其 hash 表，在读取B表，与之前的哈希表进行数据匹配

5.6 优化器

基于规则的RBO、基于成本的CBO（hive 0.14以后用的）

CBO下，多表连接，自动优化连接顺序
识别相同的连接谓词，根据情况避免高昂的表连接操作

sql hive mapreduce 数据处理

0 人点赞

hive性能调优 读书笔记 - 问题排查、调优、数据处理模式

learn from 《Hive性能调优实战》

1. 调优思路

2. 环境搭建

3. 查看执行计划

3.1 explain

3.2 explain extended

3.3 explain dependency

3.4 explain authorization

3.5 explain vectorization

4. sql执行计划解读

5. 数据处理模式

5.1 过滤模式

5.2 聚合模式

5.3 可计算中间结果的聚合

5.4 不可计算中间结果的聚合

5.5 连接模式

5.6 优化器

hive性能调优读书笔记 - 问题排查、调优、数据处理模式