WITH AS短语,也叫做子查询部分(subquery factoring),可以定义一个SQL片断,该SQL片断可以被后面的SQL语句引用,从而使SQL语句的可读性更高。
在关系型数据库
中,对于UNION ALL,使用WITH AS定义了一个UNION ALL语句,当该片断被调用2次以上,优化器会自动将该WITH AS短语所获取的数据放入一个Temp表中。
Hive在后面的版本也引入了WITH AS 这个公用表表达式(CTE)语法糖,但是对于后面语句的多次引用是否会继续将该WITH AS短语所获取的数据放入一个Temp表中呢?下面将通过对SQL的执行计划进行分析得到结论。
代码语言:javascript复制explain
with top10 as
(
select
a.member_id,
a.store_id,
a.store_name,
count(a.order_id) freq
from
ads_member_label_p_ordertop10 a
where
dt = '20210614'
group by
a.member_id,
a.store_id,
a.store_name
)
select
*
from
top10
union all
select
*
from
top10
通过执行上述HQL可以得到如下执行计划:
代码语言:javascript复制STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3
Stage-3 is a root stage
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
filterExpr: (dt = '20210614') (type: boolean)
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
outputColumnNames: member_id, order_id, store_id, store_name
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(order_id)
keys: member_id (type: string), store_id (type: string), store_name (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
sort order:
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Union
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: a
filterExpr: (dt = '20210614') (type: boolean)
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
outputColumnNames: member_id, order_id, store_id, store_name
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(order_id)
keys: member_id (type: string), store_id (type: string), store_name (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
sort order:
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 24642460 Data size: 16682132396 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 12321230 Data size: 8341066198 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
代码语言:javascript复制STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3
Stage-3 is a root stage
Stage-0 depends on stages: Stage-2
通过对执行计划进行分析可以得知Stage-1和Stage-3对表ads_member_label_p_ordertop10分别执行了两次WITH AS内逻辑,由STAGE DEPENDENCIES得知而且在Stage-2依赖于Stage-1和Stage-3,且Stage-2中UNOIN操作对表的扫描操作(TableScan)并未进行数据过滤,Stage1-3影响的数据行数均为24642460行。
所以在Hive内 WITH AS语法默认并不会将执行结果作为临时表存入内存,而是每一次引用都会执行一次WITH AS内的计算逻辑,而MySQL和SqlServer这种关系型数据库执行WITH AS时会将计算结果作为临时表。
在hive中有一个参数
代码语言:javascript复制 hive.optimize.cte.materialize.threshold
这个参数在默认情况下是-1(关闭的);当开启(大于0),比如设置为2,则如果with…as语句被引用2次及以上时,会把with…as语句生成的table物化,从而做到with…as语句只执行一次,来提高效率。
将hive.optimize.cte.materialize.threshold设置为1,可以得到如下执行计划
代码语言:javascript复制STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-3 depends on stages: Stage-0
Stage-2 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
filterExpr: (dt = '20210620') (type: boolean)
Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: member_id (type: string), order_id (type: string), store_id (type: string), store_name (type: string)
outputColumnNames: member_id, order_id, store_id, store_name
Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(order_id)
keys: member_id (type: string), store_id (type: string), store_name (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
sort order:
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 32891415 Data size: 22266423831 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 16445707 Data size: 11133211577 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dv_report.top10
Stage: Stage-0
Move Operator
files:
hdfs directory: true
destination: hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: top10
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
alias: top10
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: member_id (type: string), store_id (type: string), store_name (type: string), freq (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Fetch Operator
limit: -1
Processor Tree:
ListSink
由执行计划可知,在设置参数之后计算结果作为一张临时表存储在hdfs://iphost/tmp/hive/hive/16166a7f-d924-442a-9763-24b59ce050cb/_tmp_space.db/249708d7-bbc6-491b-b871-6b11385798d0
路径下,即通过Stage-0阶段将计算结果存入一张临时表内。