Hive语法内关于With as 的数据是否会缓存到内存分析

2021-12-07 17:42:58 浏览数 (1)

  WITH AS短语,也叫做子查询部分(subquery factoring),可以定义一个SQL片断,该SQL片断可以被后面的SQL语句引用,从而使SQL语句的可读性更高。

  在关系型数据库中,对于UNION ALL,使用WITH AS定义了一个UNION ALL语句,当该片断被调用2次以上,优化器会自动将该WITH AS短语所获取的数据放入一个Temp表中。

  Hive在后面的版本也引入了WITH AS 这个公用表表达式(CTE)语法糖,但是对于后面语句的多次引用是否会继续将该WITH AS短语所获取的数据放入一个Temp表中呢?下面将通过对SQL的执行计划进行分析得到结论。

代码语言:javascript复制
explain 
with top10 as 
(
select
	a.member_id,
	a.store_id,
	a.store_name,
	count(a.order_id) freq
from
	ads_member_label_p_ordertop10 a 
where
	dt = '20210614'
group by
	a.member_id,
	a.store_id,
	a.store_name
)
select
	*
from
	top10
union all
select
	*
from
	top10

  通过执行上述HQL可以得到如下执行计划:

代码语言:javascript复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order:    
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            Union
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210614') (type: boolean)
            Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order:    
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
代码语言:javascript复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 depends on stages: Stage-2

  通过对执行计划进行分析可以得知Stage-1和Stage-3对表ads_member_label_p_ordertop10分别执行了两次WITH AS内逻辑,由STAGE DEPENDENCIES得知而且在Stage-2依赖于Stage-1和Stage-3,且Stage-2中UNOIN操作对表的扫描操作(TableScan)并未进行数据过滤,Stage1-3影响的数据行数均为24642460行。

  所以在Hive内 WITH AS语法默认并不会将执行结果作为临时表存入内存,而是每一次引用都会执行一次WITH AS内的计算逻辑,而MySQL和SqlServer这种关系型数据库执行WITH AS时会将计算结果作为临时表。

  在hive中有一个参数

代码语言:javascript复制
	hive.optimize.cte.materialize.threshold

  这个参数在默认情况下是-1(关闭的);当开启(大于0),比如设置为2,则如果with…as语句被引用2次及以上时,会把with…as语句生成的table物化,从而做到with…as语句只执行一次,来提高效率。

  将hive.optimize.cte.materialize.threshold设置为1,可以得到如下执行计划

代码语言:javascript复制
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            filterExpr: (dt = '20210620') (type: boolean)
            Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
              outputColumnNames: member_id, order_id, store_id, store_name
              Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(order_id)
                keys: member_id (type: string), store_id (type: string), store_name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  sort order:    
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                  Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                name: dv_report.top10

  Stage: Stage-0
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            alias: top10
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Union
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

  由执行计划可知,在设置参数之后计算结果作为一张临时表存储在hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0路径下,即通过Stage-0阶段将计算结果存入一张临时表内。

0 人点赞