总结
无意发现一个非常有意思的简单语法解析器,不依赖lex/yacc,本文对其中比较难理解的表达式解析(带优先级)部分做一些分析和记录。
(理解本文需要调试后面的代码部分,have fun!)
理解表达式解析部分
这段代码的功能是解析a b (c d)*e*f g;
,包含符号优先级处理的功能。
static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
// If this is a binop, find its precedence.
while (1) {
int TokPrec = GetTokPrecedence();
// If this is a binop that binds at least as tightly as the current binop,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec)
return LHS;
// Okay, we know this is a binop.
int BinOp = CurTok;
getNextToken(); // eat binop
// Parse the primary expression after the binary operator.
ExprAST *RHS = ParsePrimary();
if (!RHS) return 0;
// If BinOp binds less tightly with RHS than the operator after RHS, let
// the pending operator take RHS as its LHS.
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
RHS = ParseBinOpRHS(TokPrec 1, RHS);
if (RHS == 0) return 0;
}
// Merge LHS/RHS.
LHS = new BinaryExprAST(BinOp, LHS, RHS);
}
}
解析流程:
- 解析:
a b (c d)*e*f g;
- 进入函数时,ExprPrec为0,LHS是a。
- 第一轮:解析
b
- TokPrec < ExprPrec 即 20 < 0:不退出递归
- TokPrec < NextPrec 即 20 < 20:不进入递归
- 符号 、
RHS=b
被合入LHS=a
,LHS变为a b
- 第二轮:解析 (c d)
- TokPrec < ExprPrec 即 20 < 0:不退出递归
- TokPrec < NextPrec 即 20 < 40:进入递归,当前
RHS=(c d)
、符号为- 递归ParseBinOpRHS第一轮:当前LHS被设为外面的
RHS=(c d)
,也就是(c d)被当做后面乘号的左值了。- 解析*e
- 进入后ExprPrec=21(因为加1后面在遇到 可以退出递归,后面在遇到比加号高的不会退出递归,很巧妙的做法),TokPrec < ExprPrec 即 40 < 21:不进入
- TokPrec < NextPrec 即 40 < 40:不退出递归
- 符号*、
RHS=e
被合入LHS=(c d)
,LHS变为(c d)*e
- 递归ParseBinOpRHS第二轮:当前LHS变为
(c d)*e
、符号为*- TokPrec < ExprPrec 即 40 < 21:不退出递归
- TokPrec < NextPrec即 40 < 20:不进入递归
- 符号*、
RHS=f
被合入LHS=(c d)*e
,LHS变为(c d)*e*f
- 递归ParseBinOpRHS第三轮:当前LHS变为
(c d)*e*f
、符号为- TokPrec < ExprPrec 即 20 < 21:退出递归!(非常重要)
- 返回
(c d)*e*f
- 递归ParseBinOpRHS第一轮:当前LHS被设为外面的
- 外层还在处理第二个加号,通过递归得到RHS=
(c d)*e*f
- 合并 、LHS=
a b
、RHS=(c d)*e*f
得到:a b (c d)*e*f
- 第三轮:解析 g
- TokPrec < ExprPrec 即 20 < 0:不退出递归
- TokPrec < NextPrec 即 20 < 20:不进入递归
- 符号 、
RHS=g
被合入LHS=a b (c d)*e*f
,LHS变为a b (c d)*e*f g
解析流程总结:
a b (c d)*e*f g;
的解析过程分了三部分,循环一次解析一组,一组的定义是:【符号 数字】或【符号 (表达式)】,也就是{ b}
、{ (c d)}
、{*e}
、{*f}
、{ g}
,解析每一组的时候,都是不断把rhs拼入lhs的过程,rhs到底是什么,需要判断是否递归解析,比如前面是 b (c d)*e
,在解析第二个加号的时候,rhs就不能是(c d)
了,需要递归的把后面乘号也解了,rhs应该是(c d)*e*f
。
三步解析:
- (外侧函数解析a)
- 解析 b
- 递归解析 (c d)ef
- 解析 g
整个解析流程就是不断把RHS拼到LHS中,最终返回LHS的过程。
中间比较重要的就是乘号和 号的优先级问题,上述代码中,进入递归的含义为:把优先级高于当前符号的所有后续表达式一块解析出来,直到遇到当前符号为止,那么这里就涉及递归进入条件和递归退出条件了:
- 递归进入条件:遇到的符号优先级比上一个符号高:
if (TokPrec < NextPrec)
- 递归退出条件:遇到的符号优先级和上一个符号相同:
if (TokPrec < ExprPrec)
假设当前符号为
遇到*
后,TokPrec=20、NextPrec=40会进入递归。
假设当前符号为*
遇到
后,TokPrec=20、ExprPrec=21会退出递归,而遇到*
的话ExprPrec=40无法退出递归,代码比较巧妙,不容易理解。
语法解析器
gcc或clang编译均可,下面makefile是clang的。
main.c
代码语言:javascript复制#include <cstdio>
#include <cstdlib>
#include <string>
#include <map>
#include <vector>
/*
* def foo(x y) x foo(y, 4.0);
*
* def foo(x y) x y y;
*
* def foo(x y) x y );
*
* extern sin(a);
*
* def foo(x y) a b (c d)*e*f g;
*/
//===----------------------------------------------------------------------===//
// Lexer
//===----------------------------------------------------------------------===//
// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
// of these for known things.
enum Token {
tok_eof = -1,
// commands
tok_def = -2, tok_extern = -3,
// primary
tok_identifier = -4, tok_number = -5
};
static std::string IdentifierStr; // Filled in if tok_identifier
static double NumVal; // Filled in if tok_number
/// gettok - Return the next token from standard input.
static int gettok() {
static int LastChar = ' ';
// Skip any whitespace.
while (isspace(LastChar))
LastChar = getchar();
if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
IdentifierStr = LastChar;
while (isalnum((LastChar = getchar())))
IdentifierStr = LastChar;
if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;
return tok_identifier;
}
if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]
std::string NumStr;
do {
NumStr = LastChar;
LastChar = getchar();
} while (isdigit(LastChar) || LastChar == '.');
NumVal = strtod(NumStr.c_str(), 0);
return tok_number;
}
if (LastChar == '#') {
// Comment until end of line.
do LastChar = getchar();
while (LastChar != EOF && LastChar != 'n' && LastChar != 'r');
if (LastChar != EOF)
return gettok();
}
// Check for end of file. Don't eat the EOF.
if (LastChar == EOF)
return tok_eof;
// Otherwise, just return the character as its ascii value.
int ThisChar = LastChar;
LastChar = getchar();
return ThisChar;
}
//===----------------------------------------------------------------------===//
// Abstract Syntax Tree (aka Parse Tree)
//===----------------------------------------------------------------------===//
/// ExprAST - Base class for all expression nodes.
class ExprAST {
public:
virtual ~ExprAST() {}
};
/// NumberExprAST - Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
double Val;
public:
NumberExprAST(double val) : Val(val) {}
};
/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
std::string Name;
public:
VariableExprAST(const std::string &name) : Name(name) {}
};
/// BinaryExprAST - Expression class for a binary operator.
class BinaryExprAST : public ExprAST {
char Op;
ExprAST *LHS, *RHS;
public:
BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
: Op(op), LHS(lhs), RHS(rhs) {}
};
/// CallExprAST - Expression class for function calls.
class CallExprAST : public ExprAST {
std::string Callee;
std::vector<ExprAST*> Args;
public:
CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
: Callee(callee), Args(args) {}
};
/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its name, and its argument names (thus implicitly the number
/// of arguments the function takes).
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;
public:
PrototypeAST(const std::string &name, const std::vector<std::string> &args)
: Name(name), Args(args) {}
};
/// FunctionAST - This class represents a function definition itself.
class FunctionAST {
PrototypeAST *Proto;
ExprAST *Body;
public:
FunctionAST(PrototypeAST *proto, ExprAST *body)
: Proto(proto), Body(body) {}
};
//===----------------------------------------------------------------------===//
// Parser
//===----------------------------------------------------------------------===//
/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
/// token the parser is looking at. getNextToken reads another token from the
/// lexer and updates CurTok with its results.
static int CurTok;
static int getNextToken() {
return CurTok = gettok();
}
/// BinopPrecedence - This holds the precedence for each binary operator that is
/// defined.
static std::map<char, int> BinopPrecedence;
/// GetTokPrecedence - Get the precedence of the pending binary operator token.
static int GetTokPrecedence() {
if (!isascii(CurTok))
return -1;
// Make sure it's a declared binop.
int TokPrec = BinopPrecedence[CurTok];
if (TokPrec <= 0) return -1;
return TokPrec;
}
/// Error* - These are little helper functions for error handling.
ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %sn", Str);return 0;}
PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
static ExprAST *ParseExpression();
/// identifierexpr
/// ::= identifier
/// ::= identifier '(' expression* ')'
static ExprAST *ParseIdentifierExpr() {
std::string IdName = IdentifierStr;
getNextToken(); // eat identifier.
if (CurTok != '(') // Simple variable ref.
return new VariableExprAST(IdName);
// Call.
getNextToken(); // eat (
std::vector<ExprAST*> Args;
if (CurTok != ')') {
while (1) {
ExprAST *Arg = ParseExpression();
if (!Arg) return 0;
Args.push_back(Arg);
if (CurTok == ')') break;
if (CurTok != ',')
return Error("Expected ')' or ',' in argument list");
getNextToken();
}
}
// Eat the ')'.
getNextToken();
return new CallExprAST(IdName, Args);
}
/// numberexpr ::= number
static ExprAST *ParseNumberExpr() {
ExprAST *Result = new NumberExprAST(NumVal);
getNextToken(); // consume the number
return Result;
}
/// parenexpr ::= '(' expression ')'
static ExprAST *ParseParenExpr() {
getNextToken(); // eat (.
ExprAST *V = ParseExpression();
if (!V) return 0;
if (CurTok != ')')
return Error("expected ')'");
getNextToken(); // eat ).
return V;
}
/// primary
/// ::= identifierexpr
/// ::= numberexpr
/// ::= parenexpr
static ExprAST *ParsePrimary() {
switch (CurTok) {
default: return Error("unknown token when expecting an expression");
case tok_identifier: return ParseIdentifierExpr();
case tok_number: return ParseNumberExpr();
case '(': return ParseParenExpr();
}
}
/// binoprhs
/// ::= (' ' primary)*
// 函数ParseBinOpRHS用于解析有序对列表(其中RHS是Right Hand Side的缩写,表示“右侧”;与此相对应,LHS表示“左侧”——译者注)。
// 它的参数包括一个整数和一个指针,其中整数代表运算符优先级,指针则指向当前已解析出来的那部分表达式。注意,单独一个“x”也是合法的表达式:
// 也就是说binoprhs有可能为空;碰到这种情况时,函数将直接返回作为参数传入的表达式。在上面的例子中,传入ParseBinOpRHS的表达式是“a”,当前语元是“ ”。
// 传入ParseBinOpRHS的优先级表示的是该函数所能处理的最低运算符优先级。假设语元流中的下一对是“[ , x]”,且传入ParseBinOpRHS的优先级是40,
// 那么该函数将直接返回(因为“ ”的优先级是20)。搞清楚这一点之后,我们再来看ParseBinOpRHS的定义,函数的开头是这样的:
// a b (c d)*e*f g
// a [ , b]、[ , (c d)]、[*, e]、[*, f]和[ , g]
static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
// If this is a binop, find its precedence.
while (1) {
int TokPrec = GetTokPrecedence();
// If this is a binop that binds at least as tightly as the current binop,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec)
return LHS;
// Okay, we know this is a binop.
int BinOp = CurTok;
getNextToken(); // eat binop
// Parse the primary expression after the binary operator.
ExprAST *RHS = ParsePrimary();
if (!RHS) return 0;
// If BinOp binds less tightly with RHS than the operator after RHS, let
// the pending operator take RHS as its LHS.
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
RHS = ParseBinOpRHS(TokPrec 1, RHS);
if (RHS == 0) return 0;
}
// Merge LHS/RHS.
LHS = new BinaryExprAST(BinOp, LHS, RHS);
}
}
/// expression
/// ::= primary binoprhs
///
// def foo(x y) x y y;
// 这里开始解析x y部分:
static ExprAST *ParseExpression() {
ExprAST *LHS = ParsePrimary();
if (!LHS) return 0;
return ParseBinOpRHS(0, LHS);
}
/// prototype
/// ::= id '(' id* ')'
static PrototypeAST *ParsePrototype() {
if (CurTok != tok_identifier)
return ErrorP("Expected function name in prototype");
std::string FnName = IdentifierStr;
getNextToken();
if (CurTok != '(')
return ErrorP("Expected '(' in prototype");
std::vector<std::string> ArgNames;
while (getNextToken() == tok_identifier)
ArgNames.push_back(IdentifierStr);
if (CurTok != ')')
return ErrorP("Expected ')' in prototype");
// success.
getNextToken(); // eat ')'.
return new PrototypeAST(FnName, ArgNames);
}
/// definition ::= 'def' prototype expression
static FunctionAST *ParseDefinition() {
getNextToken(); // eat def.
PrototypeAST *Proto = ParsePrototype();
if (Proto == 0) return 0;
if (ExprAST *E = ParseExpression())
return new FunctionAST(Proto, E);
return 0;
}
/// toplevelexpr ::= expression
static FunctionAST *ParseTopLevelExpr() {
if (ExprAST *E = ParseExpression()) {
// Make an anonymous proto.
PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
return new FunctionAST(Proto, E);
}
return 0;
}
/// external ::= 'extern' prototype
static PrototypeAST *ParseExtern() {
getNextToken(); // eat extern.
return ParsePrototype();
}
//===----------------------------------------------------------------------===//
// Top-Level parsing
//===----------------------------------------------------------------------===//
static void HandleDefinition() {
if (ParseDefinition()) {
fprintf(stderr, "Parsed a function definition.n");
} else {
// Skip token for error recovery.
getNextToken();
}
}
static void HandleExtern() {
if (ParseExtern()) {
fprintf(stderr, "Parsed an externn");
} else {
// Skip token for error recovery.
getNextToken();
}
}
static void HandleTopLevelExpression() {
// Evaluate a top-level expression into an anonymous function.
if (ParseTopLevelExpr()) {
fprintf(stderr, "Parsed a top-level exprn");
} else {
// Skip token for error recovery.
getNextToken();
}
}
/// top ::= definition | external | expression | ';'
static void MainLoop() {
while (1) {
fprintf(stderr, "ready> ");
switch (CurTok) {
case tok_eof: return;
case ';': getNextToken(); break; // ignore top-level semicolons.
case tok_def: HandleDefinition(); break;
case tok_extern: HandleExtern(); break;
default: HandleTopLevelExpression(); break;
}
}
}
//===----------------------------------------------------------------------===//
// Main driver code.
//===----------------------------------------------------------------------===//
int main() {
// Install standard binary operators.
// 1 is lowest precedence.
BinopPrecedence['<'] = 10;
BinopPrecedence[' '] = 20;
BinopPrecedence['-'] = 20;
BinopPrecedence['*'] = 40; // highest.
// Prime the first token.
fprintf(stderr, "ready> ");
getNextToken();
// Run the main "interpreter loop" now.
MainLoop();
return 0;
}
Makefile
代码语言:javascript复制CC = llvm-g -stdlib=libc -std=c 14
CFLAGS = -g -O0 -I llvm/include -I llvm/build/include -I ./
LLVMFLAGS = `llvm-config --cxxflags --ldflags --system-libs --libs all`
.PHONY: main
main: main.cpp
${CC} ${CFLAGS} ${LLVMFLAGS} $< -o $@
clean:
rm -r main main.o
%.o: %.cpp
${CC} ${CFLAGS} ${LLVMFLAGS} -c $< -o $@