sparksql源码系列 | ResolveReferences规则count(*)详解

2022-06-09 21:30:14 浏览数 (2)

本文基于spark 3.2

这篇文章是做上次源码调试分享上留的一个作业题

1、select * from TESTDATA2,分析一下【*】的情况,看看是怎么把【*】转化为对应字段的。匹配ResolveReferences中的这段代码:case p: Project if containsStar(p.projectList) => p.copy(projectList = buildExpandedProjectList(p.projectList, p.child))

sql:

代码语言:javascript复制
select * from testdata2

对应astTree:

unresolved logical plan 、resolved Logical Plan 以及这中间用到的规则:

生成resolved Logical Plan用的所有规则一览

代码语言:javascript复制

==  Parsed Logical Plan  ==
'Project [*]
 - 'UnresolvedRelation [testdata2], [], false


//*********************** 规则1************************
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
'Project [*]
 - SubqueryAlias testdata2
    - View (`testData2`, [a#3,b#4])
       - SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
          - ExternalRDD [obj#2]

//*********************** 规则2************************
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
Project [a#3, b#4]
 - SubqueryAlias testdata2
    - View (`testData2`, [a#3,b#4])
       - SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
          - ExternalRDD [obj#2]

== Analyzed Logical Plan ==
Project [a#3, b#4]
 - SubqueryAlias testdata2
    - View (`testData2`, [a#3,b#4])
       - SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
          - ExternalRDD [obj#2]

源码过程分析

主要看Project [*] 是怎么转化为 Project [a#3, b#4] 的,ResolveReferences 规则的作用在源码共读分享上说过了:

主要是把 UnresolvedAttribute 替换为AttributeReference

从代码,可以看到,把*展开,用的是buildExpandedProjectList方法:

【*】是UnresolvedStar,UnresolvedStar是Star的子类:

所以,会走第一个case,expand方法,而expand最终调用了UnresolvedStar 的 expand 方法:

我们来debug康康input.output 里面有啥:

这里的input是SubqueryAlias节点,output方法,实际上就是遵循逻辑执行计划的output方法,这个在上一次的源码共读分享中很详细的讲过了:

最后,总结一下:output每一步都是根据底部已经resloved的Attribute来给顶部的Attribute赋值,从而保证两个Attribute是指向同一个。

0 人点赞