语义解析主要是把AST Tree转化为QueryBlock,那为什么要转成QueryBlock呢?从之前的分析,我们可以看到AST Tree 还是很抽象,并且也不携带表、字段相关的信息,进行语义解析可以将AST Tree分模块存入QueryBlock 并携带对应的元数据信息,为生成逻辑执行计划做准备
简单串一下语义解析
sql编译器的入口:
代码语言:javascript复制 BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, tree);
List<HiveSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
HiveSemanticAnalyzerHook.class);
// Flush the metastore cache. This assures that we don't pick up objects from a previous
// query running in this same thread. This has to be done after we get our semantic
// analyzer (this is when the connection to the metastore is made) but before we analyze,
// because at that point we need access to the objects.
Hive.get().getMSC().flushCache();
// Do semantic analysis and plan generation
if (saHooks != null && !saHooks.isEmpty()) { //hive的hook机制,在hook中实现一些方法来对语句做预判
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
hookCtx.setUserName(userName);
hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
hookCtx.setCommand(command);
for (HiveSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (HiveSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getAllRootTasks());
}
} else {
sem.analyze(tree, ctx); //直接进入编译
}
进入sql编译之前,先判断是不是设置了hive.semantic.analyzer.hook参数,这个是hive的hook机制,在hook中实现一些方法来对语句做预判,具体做法是实现HiveSemanticAnalyzerHook接口,preAnalyze方法和postAnalyze方法会分别在编译之前和之后执行。 在这里,我们更关心编译模块sem.analyze(tree, ctx)。
sem是由BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, tree) 获取。这里用到了java设计模式中的工厂模式:
代码语言:javascript复制public static BaseSemanticAnalyzer get(QueryState queryState, ASTNode tree)
throws SemanticException {
if (tree.getToken() == null) {
throw new RuntimeException("Empty Syntax Tree");
} else {
HiveOperation opType = commandType.get(tree.getType());
queryState.setCommandType(opType);
switch (tree.getType()) {
case HiveParser.TOK_EXPLAIN:
return new ExplainSemanticAnalyzer(queryState);
case HiveParser.TOK_EXPLAIN_SQ_REWRITE:
return new ExplainSQRewriteSemanticAnalyzer(queryState);
case HiveParser.TOK_LOAD:
return new LoadSemanticAnalyzer(queryState);
case HiveParser.TOK_EXPORT:
return new ExportSemanticAnalyzer(queryState);
case HiveParser.TOK_IMPORT:
return new ImportSemanticAnalyzer(queryState);
case HiveParser.TOK_ALTERTABLE: {
Tree child = tree.getChild(1);
switch (child.getType()) {
case HiveParser.TOK_ALTERTABLE_RENAME:
case HiveParser.TOK_ALTERTABLE_TOUCH:
case HiveParser.TOK_ALTERTABLE_ARCHIVE:
case HiveParser.TOK_ALTERTABLE_UNARCHIVE:
case HiveParser.TOK_ALTERTABLE_ADDCOLS:
case HiveParser.TOK_ALTERTABLE_RENAMECOL:
case HiveParser.TOK_ALTERTABLE_REPLACECOLS:
case HiveParser.TOK_ALTERTABLE_DROPPARTS:
case HiveParser.TOK_ALTERTABLE_ADDPARTS:
case HiveParser.TOK_ALTERTABLE_PARTCOLTYPE:
case HiveParser.TOK_ALTERTABLE_PROPERTIES:
case HiveParser.TOK_ALTERTABLE_DROPPROPERTIES:
case HiveParser.TOK_ALTERTABLE_EXCHANGEPARTITION:
case HiveParser.TOK_ALTERTABLE_SKEWED:
case HiveParser.TOK_ALTERTABLE_DROPCONSTRAINT:
case HiveParser.TOK_ALTERTABLE_ADDCONSTRAINT:
queryState.setCommandType(commandType.get(child.getType()));
return new DDLSemanticAnalyzer(queryState);
}
opType =
tablePartitionCommandType.get(child.getType())[tree.getChildCount() > 2 ? 1 : 0];
queryState.setCommandType(opType);
return new DDLSemanticAnalyzer(queryState);
}
case HiveParser.TOK_ALTERVIEW: {
Tree child = tree.getChild(1);
switch (child.getType()) {
case HiveParser.TOK_ALTERVIEW_PROPERTIES:
case HiveParser.TOK_ALTERVIEW_DROPPROPERTIES:
case HiveParser.TOK_ALTERVIEW_ADDPARTS:
case HiveParser.TOK_ALTERVIEW_DROPPARTS:
case HiveParser.TOK_ALTERVIEW_RENAME:
opType = commandType.get(child.getType());
queryState.setCommandType(opType);
return new DDLSemanticAnalyzer(queryState);
}
// TOK_ALTERVIEW_AS
assert child.getType() == HiveParser.TOK_QUERY;
queryState.setCommandType(HiveOperation.ALTERVIEW_AS);
return new SemanticAnalyzer(queryState);
}
case HiveParser.TOK_CREATEDATABASE:
case HiveParser.TOK_DROPDATABASE:
case HiveParser.TOK_SWITCHDATABASE:
case HiveParser.TOK_DROPTABLE:
case HiveParser.TOK_DROPVIEW:
case HiveParser.TOK_DESCDATABASE:
case HiveParser.TOK_DESCTABLE:
case HiveParser.TOK_DESCFUNCTION:
case HiveParser.TOK_MSCK:
case HiveParser.TOK_ALTERINDEX_REBUILD:
case HiveParser.TOK_ALTERINDEX_PROPERTIES:
case HiveParser.TOK_SHOWDATABASES:
case HiveParser.TOK_SHOWTABLES:
case HiveParser.TOK_SHOWCOLUMNS:
case HiveParser.TOK_SHOW_TABLESTATUS:
case HiveParser.TOK_SHOW_TBLPROPERTIES:
case HiveParser.TOK_SHOW_CREATEDATABASE:
case HiveParser.TOK_SHOW_CREATETABLE:
case HiveParser.TOK_SHOWFUNCTIONS:
case HiveParser.TOK_SHOWPARTITIONS:
case HiveParser.TOK_SHOWINDEXES:
case HiveParser.TOK_SHOWLOCKS:
case HiveParser.TOK_SHOWDBLOCKS:
case HiveParser.TOK_SHOW_COMPACTIONS:
case HiveParser.TOK_SHOW_TRANSACTIONS:
case HiveParser.TOK_ABORT_TRANSACTIONS:
case HiveParser.TOK_SHOWCONF:
case HiveParser.TOK_CREATEINDEX:
case HiveParser.TOK_DROPINDEX:
case HiveParser.TOK_ALTERTABLE_CLUSTER_SORT:
case HiveParser.TOK_LOCKTABLE:
case HiveParser.TOK_UNLOCKTABLE:
case HiveParser.TOK_LOCKDB:
case HiveParser.TOK_UNLOCKDB:
case HiveParser.TOK_CREATEROLE:
case HiveParser.TOK_DROPROLE:
case HiveParser.TOK_GRANT:
case HiveParser.TOK_REVOKE:
case HiveParser.TOK_SHOW_GRANT:
case HiveParser.TOK_GRANT_ROLE:
case HiveParser.TOK_REVOKE_ROLE:
case HiveParser.TOK_SHOW_ROLE_GRANT:
case HiveParser.TOK_SHOW_ROLE_PRINCIPALS:
case HiveParser.TOK_SHOW_ROLES:
case HiveParser.TOK_ALTERDATABASE_PROPERTIES:
case HiveParser.TOK_ALTERDATABASE_OWNER:
case HiveParser.TOK_TRUNCATETABLE:
case HiveParser.TOK_SHOW_SET_ROLE:
case HiveParser.TOK_CACHE_METADATA:
return new DDLSemanticAnalyzer(queryState);
case HiveParser.TOK_CREATEFUNCTION:
case HiveParser.TOK_DROPFUNCTION:
case HiveParser.TOK_RELOADFUNCTION:
return new FunctionSemanticAnalyzer(queryState);
case HiveParser.TOK_ANALYZE:
return new ColumnStatsSemanticAnalyzer(queryState);
case HiveParser.TOK_CREATEMACRO:
case HiveParser.TOK_DROPMACRO:
return new MacroSemanticAnalyzer(queryState);
case HiveParser.TOK_UPDATE_TABLE:
case HiveParser.TOK_DELETE_FROM:
return new UpdateDeleteSemanticAnalyzer(queryState);
case HiveParser.TOK_START_TRANSACTION:
case HiveParser.TOK_COMMIT:
case HiveParser.TOK_ROLLBACK:
case HiveParser.TOK_SET_AUTOCOMMIT:
default: {
SemanticAnalyzer semAnalyzer = HiveConf //判断CBO是否为true,如果为true,走CalcitePlanner,否则SemanticAnalyzer
.getBoolVar(queryState.getConf(), HiveConf.ConfVars.HIVE_CBO_ENABLED) ?
new CalcitePlanner(queryState) : new SemanticAnalyzer(queryState);
return semAnalyzer;
}
}
}
}
针对不同功能的sql,hive有多种编译方式,比如:explain走ExplainSemanticAnalyzer,DDL走DDLSemanticAnalyzer,load走LoadSemanticAnalyzer等等,工厂模式可以使这些不同的功能隔离开,在一定程度上解耦,也增加了可扩展性,比如某天需要再添加个import数据的编译过程,开发个ImportSemanticAnalyzer类 在SemanticAnalyzerFactory工厂里注册一下就ok了。
然而,我们更多的是使用query,这次的源码分析也是围绕query展开,因此,我们就进入了default 选项,然后就会发现,有个判断,判断hive.cbo.enable是否为true,如果为 true就走 CalcitePlanner 类,否则走 SemanticAnalyzer 类。 CBO是基于代价的优化方式,功能很强大,hive2.x在CBO方面也下了很大的功夫,这个优化默认是开启的,我们在究研源码的时候,先关闭掉CBO,后面会专门再来讨论CBO。
hql的编译主要就是 SemanticAnalyzer.analyzeInternal 方法:
代码语言:javascript复制void analyzeInternal(ASTNode ast, PlannerContext plannerCtx) throws SemanticException {
// 1. Generate Resolved Parse tree from syntax tree
LOG.info("Starting Semantic Analysis");
if (!genResolvedParseTree(ast, plannerCtx)) { //语义解析
return;
}
// 2. Gen OP Tree from resolved Parse Tree
Operator sinkOp = genOPTree(ast, plannerCtx); //生成逻辑执行计划
...
...
// 7. Perform Logical optimization
if (LOG.isDebugEnabled()) {
LOG.debug("Before logical optimizationn" Operator.toString(pCtx.getTopOps().values()));
}
Optimizer optm = new Optimizer();
optm.setPctx(pCtx);
optm.initialize(conf);
pCtx = optm.optimize(); //优化逻辑执行计划
if (pCtx.getColumnAccessInfo() != null) {
// set ColumnAccessInfo for view column authorization
setColumnAccessInfo(pCtx.getColumnAccessInfo());
}
FetchTask origFetchTask = pCtx.getFetchTask();
if (LOG.isDebugEnabled()) {
LOG.debug("After logical optimizationn" Operator.toString(pCtx.getTopOps().values()));
}
// 8. Generate column access stats if required - wait until column pruning
// takes place during optimization
boolean isColumnInfoNeedForAuth = SessionState.get().isAuthorizationModeV2()
&& HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_AUTHORIZATION_ENABLED);
if (isColumnInfoNeedForAuth
|| HiveConf.getBoolVar(this.conf, HiveConf.ConfVars.HIVE_STATS_COLLECT_SCANCOLS)) {
ColumnAccessAnalyzer columnAccessAnalyzer = new ColumnAccessAnalyzer(pCtx);
// view column access info is carried by this.getColumnAccessInfo().
setColumnAccessInfo(columnAccessAnalyzer.analyzeColumnAccess(this.getColumnAccessInfo()));
}
// 9. Optimize Physical op tree & Translate to target execution engine (MR,
// TEZ..)
if (!ctx.getExplainLogical()) {
TaskCompiler compiler = TaskCompilerFactory.getCompiler(conf, pCtx);
compiler.init(queryState, console, db);
compiler.compile(pCtx, rootTasks, inputs, outputs); //生成物理执行计划及优化
fetchTask = pCtx.getFetchTask();
}
LOG.info("Completed plan generation");
// 10. put accessed columns to readEntity
if (HiveConf.getBoolVar(this.conf, HiveConf.ConfVars.HIVE_STATS_COLLECT_SCANCOLS)) {
putAccessedColumnsToReadEntity(inputs, columnAccessInfo);
}
// 11. if desired check we're not going over partition scan limits
if (!ctx.getExplain()) {
enforceScanLimits(pCtx, origFetchTask);
}
return;
}
语义解析是代码中的第一步genResolvedParseTree 方法
代码语言:javascript复制boolean genResolvedParseTree(ASTNode ast, PlannerContext plannerCtx) throws SemanticException {
ASTNode child = ast;
this.ast = ast;
viewsExpanded = new ArrayList<String>();
ctesExpanded = new ArrayList<String>();
.....
// 4. continue analyzing from the child ASTNode.
Phase1Ctx ctx_1 = initPhase1Ctx();
preProcessForInsert(child, qb);
if (!doPhase1(child, qb, ctx_1, plannerCtx)) { //核心1
// if phase1Result false return
return false;
}
LOG.info("Completed phase 1 of Semantic Analysis");
// 5. Resolve Parse Tree
// Materialization is allowed if it is not a view definition
getMetaData(qb, createVwDesc == null); ////核心2
LOG.info("Completed getting MetaData in Semantic Analysis");
plannerCtx.setParseTreeAttr(child, ctx_1);
return true;
}
genResolvedParseTree方法比较长,其中最核心的就是doPhase1和getMetaData。
doPhase1负责把 ASTTree 大卸八块存入对应的QB,getMetaData 负责把表、字段等元数据信息存入QB。
当这一切都执行完了之后,语义解析模块就结束了。
总结一下语义解析的代码路径:
Driver.compile
--> SemanticAnalyzerFactory.get(queryState, tree)
-->SemanticAnalyzer.analyzeInternal
-->SemanticAnalyzer.genResolvedParseTree
-->SemanticAnalyzer.doPhase1
-->SemanticAnalyzer.getMetaData `
路径这么描述也不是很准确,重要的是能明白就好