hive性能调优 读书笔记 - 问题排查、调优、数据处理模式

2023-03-12 13:39:31 浏览数 (2)

learn from 《Hive性能调优实战》

本文使用的 hive 版本 2.3.2

1. 调优思路

  • 分析定位性能瓶颈 优化存储、优化执行过程、优化作业调度
  • 性能监控和告警,开源工具 Grafana、Ambari等

2. 环境搭建

  • Cloudera Docker 搭建伪分布式环境 本人是 win 安装的 docker desktop,在 wsl 中操作
代码语言:javascript复制
docker pull cloudera/quickstart:latest
docker run --hostname=quickstart.Cloudera --privileged=true -it -p 80:80 -p 8888:8888 -p 8088:8088 Cloudera/quickstart /usr/bin/Docker-quickstart
代码语言:javascript复制
hdfs dfsadmin -report
  • docker-compose部署hive、kafka服务

3. 查看执行计划

3.1 explain

  • explain sql
代码语言:javascript复制
explain
select s_age, count(1) num from student_tb_orc
where s_age<30 and s_name like '%红%'
group by s_age
代码语言:javascript复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce  # 执行引擎
      Map Operator Tree: # map 操作
          TableScan
            alias: student_tb_orc  # 扫描的表
            Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  # 过滤操作
              predicate: ((s_age < 30) and (s_name like '%???%')) (type: boolean)
              # 过滤时使用的谓词
              Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
              # 处理的数据量,预估值
              Select Operator  # 在之前的结果集上列投影,筛选列
                expressions: s_age (type: bigint)  # 筛选的列
                outputColumnNames: _col0 # 输出的列名字
                Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                Group By Operator # 分组聚合
                  aggregations: count(1)  # 分组聚合使用的算法
                  keys: _col0 (type: bigint)  # 分组所在的列
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator # 对之前结果聚合后的输出信息
                    key expressions: _col0 (type: bigint)  # 聚合后的 key, s_age
                    sort order:    # 输出是否排序,  正序, - 倒序
                    Map-reduce partition columns: _col0 (type: bigint) # 分区列
                    Statistics: Num rows: 3333333 Data size: 5071866159 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col1 (type: bigint) # 聚合后的 value, count(1)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1666666 Data size: 2535932318 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false  # 文件输出的结果是否进行压缩
            Statistics: Num rows: 1666666 Data size: 2535932318 Basic stats: COMPLETE Column stats: NONE
            table:  # 当前操作表的信息
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                # 输入、输出文件的类型
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                # serde 读取数据的序列化和反序列化的方式

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

上面描述的是计算引擎的执行逻辑

如果想看 spark 引擎的执行计划

代码语言:javascript复制
set hive.execution.engine=spark;
explain
select s_age, count(1)
from student_tb_txt
group by s_age;

3.2 explain extended

更具体的信息

  • 抽象语法树, hive 3.0 以后用 explain ast

3.3 explain dependency

sql需要的数据来源

代码语言:javascript复制
hive> explain dependency
    > select s_age, count(1) num from student_tb_orc
    > where s_age < 30 and s_name like '%red%' group by s_age;
OK
{"input_tables":[{"tablename":"default@student_tb_orc","tabletype":"MANAGED_TABLE"}],"input_partitions":[]}
# 库 @ 表, 表类型:内部表,分区
Time taken: 2.01 seconds, Fetched: 1 row(s)
代码语言:javascript复制
hive> explain dependency
    > select s_age, count(1) num from student_orc_partition
    > where s_age < 30 and s_name like '%red%' group by s_age;
OK
{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"}]
# # 库 @ 表 @ 分区列的值}
Time taken: 0.396 seconds, Fetched: 1 row(s)

使用场景:

  • 排除分区数据异常
  • 弄清楚数据输入
代码语言:javascript复制
explain dependency
select a.s_no from student_orc_partition a 
left join student_orc_partition_1 b
on a.s_no=b.s_no 
and a.s_age=b.s_age 
and a.s_age>=22 
and a.s_age<=23;
代码语言:javascript复制
{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=20"},
{"partitionName":"default@student_orc_partition_1@s_age=21"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"},
{"partitionName":"default@student_orc_partition_1@s_age=24"},
{"partitionName":"default@student_orc_partition_1@s_age=25"},
{"partitionName":"default@student_orc_partition_1@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=27"}]}

a 左连接 b,过滤条件是针对左表 a,完全没有起到作用,所有分区都要扫描

代码语言:javascript复制
explain dependency
select a.s_no from student_orc_partition a 
left join student_orc_partition_1 b
on a.s_no=b.s_no 
and a.s_age=b.s_age 
and b.s_age>=22 
and b.s_age<=23;
代码语言:javascript复制
{"input_tables":[{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=20"},
{"partitionName":"default@student_orc_partition@s_age=21"},
{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition@s_age=24"},
{"partitionName":"default@student_orc_partition@s_age=25"},
{"partitionName":"default@student_orc_partition@s_age=26"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"}]}

a 左连接 b,过滤条件是针对右表 b,b表只扫描了指定分区,a 表需要全表扫描

建议: 尽早过滤掉不需要的数据和列

代码语言:javascript复制
explain dependency
select a.s_no
from (select s_no, s_age
      from student_orc_partition
      where s_age >= 22
        and s_age <= 23) a
left join student_orc_partition_1 b
on a.s_no = b.s_no and a.s_age = b.s_age;
代码语言:javascript复制
{"input_tables":[{"tablename":"default@student_orc_partition_1","tabletype":"MANAGED_TABLE"},
{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}],"input_partitions":
[{"partitionName":"default@student_orc_partition@s_age=22"},
{"partitionName":"default@student_orc_partition@s_age=23"},
{"partitionName":"default@student_orc_partition_1@s_age=22"},
{"partitionName":"default@student_orc_partition_1@s_age=23"}]}

两个表都只扫描了 2个分区的数据

3.4 explain authorization

查询权限相关的信息

代码语言:javascript复制
explain authorization
select variance(s_score) from student_tb_orc;
代码语言:javascript复制
INPUTS: 
  default@student_tb_orc # 输入库@表
OUTPUTS:  # 输出数据的路径
  file:/tmp/root/3cb0f818-35ba-44a0-9fb4-f71e4ce56b4d/hive_2023-03-11_06-00-29_288_4051189352612619596-1/-mr-10001
CURRENT_USER: # 当前用户
  root
OPERATION: # 数据操作是查询
  QUERY

3.5 explain vectorization

hive 2.3 以后可用,向量化执行,提高数据处理性能

代码语言:javascript复制
set hive.vectorized.execution.enabled=true;
explain vectorization only
select count(1) from student_tb_orc;
代码语言:javascript复制
PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator

如果改为 false

代码语言:javascript复制
PLAN VECTORIZATION:
  enabled: false
  enabledConditionsNotMet: [hive.vectorized.execution.enabled IS false]

还有其他命令 explain vectorization operatorexplain vectorization expression

4. sql执行计划解读

  • select from where 这类sql,没有reduce阶段
  • select func(col) from where func(col), select operation from where operation 带函数、操作符的sql,跟上面一样也没有 reduce 阶段
  • 带聚合函数的sql
代码语言:javascript复制
set hive.map.aggr=true;
explain select s_age, avg(s_score) from student_tb_orc
where s_age < 20
group by s_age;

分别设置 hive.map.aggr 开启、不开启 查看日志

开启的话,在 map 阶段就有 reduce 操作

  • join sql
代码语言:javascript复制
explain
select a.s_no from student_tb_orc a inner join student_orc_partition b
on a.s_no=b.s_no;
代码语言:javascript复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: s_no (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order:  
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 20000000 Data size: 30431200000 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: s_no (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order:  
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 20000000 Data size: 30274000639 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Join Operator  # 连接操作
          condition map:
               Inner Join 0 to 1 
               # 0 表示map阶段输出的a数据集
               # 1 是b输出的数据集
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0
          Statistics: Num rows: 22000000 Data size: 33474320725 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 22000000 Data size: 33474320725 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
代码语言:javascript复制
select a.s_no from student_tb_orc a 
left semi join student_orc_partition b
on a.s_no=b.s_no;

select a.s_no from student_tb_orc a where a.s_no 
in (select s_no from student_orc_partition)

以上两个 sql 的效果是等效的,执行计划一致

代码语言:javascript复制
select a.s_no from student_tb_orc a where a.s_age in (21,22,23)
# in 里面的条件少,可以减少 map 和 shuffle 的数量
select a.s_no from student_tb_orc a where a.s_age in (21,21,22,23)

5. 数据处理模式

5.1 过滤模式

  • where 过滤,发生在 map 阶段,可以减少后序数据跨机器传输
代码语言:javascript复制
select count(s_age) from (
	select s_age, count(1) num
	from student_tb_seq
	group by s_age
) a
where s_age <30 and num>20

执行时间 268s

代码语言:javascript复制
select count(s_age) from (
	select s_age, count(1) num
	from student_tb_seq
	where s_age<30
	group by s_age
		having count(1)>20
) a

执行时间 223s

  • having 过滤,在聚合后
  • distinct 过滤,用于过滤重复数据,发生在 reduce 阶段
  • 表过滤,一次读取,多次使用
代码语言:javascript复制
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
explain
insert into table student_stat partition(tp)
select s_age,max(s_birth) stat,'max' tp
from student_tb_seq
group by s_age
union all
select s_age,min(s_birth) stat, 'min' tp
from student_tb_seq
group by s_age;

以上sql需要读取两次原始表

代码语言:javascript复制
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
explain
from student_tb_seq
insert into table student_stat partition(tp)
select s_age,min(s_birth) stat,'max' stat
group by s_age
insert into table  student_stat partition(tp)
select s_age,max(s_birth) stat,'min' stat
group by s_age;

from ... select 形式,只读取了表一次

  • 分区过滤,其实是在 map的上一个阶段(输入阶段进行路径过滤)
  • 分桶过滤,对文件的过滤
  • 列过滤,orc 、parquet 格式的数据,可以直接读取字段,不需要取出整行,再按分隔符切割后选取字段

5.2 聚合模式

  • distinct 模式,开启了 hive.map.aggr=true 时,使用 distinct 子句时,在 map 阶段就会开始局部的聚合,减少流转到下游的数据量
  • 计数聚合模式 –count(col),值为 null 的不会被计数,需要读取具体的数据 –count(*),count(1),不需要读取具体数据,是读取每行数据的偏移量
代码语言:javascript复制
explain
Select count(s_score) from student_tb_orc;

5.3 可计算中间结果的聚合

数据可以局部归并汇总,减少数据流转量 例如,求 sum,max, min

5.4 不可计算中间结果的聚合

将多行数据聚合到一行当中,如 collect_list(),collect_set()

5.5 连接模式

  • repartition 连接,发生在 shufflereduce 阶段,map 任务读取 AB两个表的数据,按连接条件发往相同的 reduce,在 reduce 中计算合并的结果
  • map join 先读取小表A,构建其 hash 表,在读取B表,与之前的哈希表进行数据匹配

5.6 优化器

基于规则的RBO、基于成本的CBO(hive 0.14以后用的)

  • CBO下, 多表连接,自动优化连接顺序
  • 识别相同的连接谓词,根据情况避免高昂的表连接操作

0 人点赞