本文基于spark 3.2
这篇文章是做上次源码调试分享上留的一个作业题
1、select * from TESTDATA2,分析一下【*】的情况,看看是怎么把【*】转化为对应字段的。匹配ResolveReferences中的这段代码:case p: Project if containsStar(p.projectList) => p.copy(projectList = buildExpandedProjectList(p.projectList, p.child)) |
---|
sql:
代码语言:javascript复制select * from testdata2
对应astTree:
unresolved logical plan 、resolved Logical Plan 以及这中间用到的规则:
生成resolved Logical Plan用的所有规则一览
代码语言:javascript复制
== Parsed Logical Plan ==
'Project [*]
- 'UnresolvedRelation [testdata2], [], false
//*********************** 规则1************************
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
'Project [*]
- SubqueryAlias testdata2
- View (`testData2`, [a#3,b#4])
- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
- ExternalRDD [obj#2]
//*********************** 规则2************************
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
Project [a#3, b#4]
- SubqueryAlias testdata2
- View (`testData2`, [a#3,b#4])
- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
- ExternalRDD [obj#2]
== Analyzed Logical Plan ==
Project [a#3, b#4]
- SubqueryAlias testdata2
- View (`testData2`, [a#3,b#4])
- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
- ExternalRDD [obj#2]
源码过程分析
主要看Project [*] 是怎么转化为 Project [a#3, b#4] 的,ResolveReferences 规则的作用在源码共读分享上说过了:
主要是把 UnresolvedAttribute 替换为AttributeReference
从代码,可以看到,把*展开,用的是buildExpandedProjectList方法:
【*】是UnresolvedStar,UnresolvedStar是Star的子类:
所以,会走第一个case,expand方法,而expand最终调用了UnresolvedStar 的 expand 方法:
我们来debug康康input.output 里面有啥:
这里的input是SubqueryAlias节点,output方法,实际上就是遵循逻辑执行计划的output方法,这个在上一次的源码共读分享中很详细的讲过了:
最后,总结一下:output每一步都是根据底部已经resloved的Attribute来给顶部的Attribute赋值,从而保证两个Attribute是指向同一个。