skaiuijing

前言

当年在github上有一个非常火的项目,名字叫做c4,有上万star。它是一个c语言的编译器,只用了五百行代码,就实现了编译器自举,可以说是非常厉害。

C4编译器实际上是一个非常小型的编译器,是一个学习编译原理的好素材,它实现了一个简单的 C 语言子集,包含前端和后端,最让人惊艳的就是它能够自举。

其实c4或许称之为解释器更合适。不过无所谓,打个比方,如果说编译器是做一大桌菜然后开吃,那么解释器就是边做菜边吃,这么一看,反正都是吃!(><)

好吧,开个玩笑而已。

还是让笔者简单介绍编译器与解释器的概念,然后再介绍一下c4的整体设计吧。

编译器与解释器

编译器是通过翻译的方式阅读程序并且生成对应平台上的程序,对应的平台根据生成的程序执行对应的操作。

解释器是通过解析源程序的每一个输入并转换为对应平台的指令来执行对应的操作。

不过两种方法并没有高低之分,编译器生成的程序的效率更高,解释器生成的程序更容易排查错误。

像java语言,就是结合了两者的处理。一个java程序先被编译为字节码,再由虚拟机逐个执行。

编译执行流程

词法分析:读入组成源程序的的字符流,例如temp = b +c,我们会将temp映射成词法单词<id,1>,b和c同样这么干,变换后如下:

<id,1> <=><id,2> <+> <id,3>

语法分析:通过创建树形结构(语法树)表示语句,使用递归下降等方式进行解析

语义分析:符号表、类型检查都发生在这一步,会附加对应的语义动作。

中间代码生成:构造中间形式,例如三地址代码

以上就是前端内容了。

机器无关优化:主要通过数据流分析消除冗余。

寄存器分配:将变量分配到机器的物理寄存器中,以减少对内存的访问

机器相关优化:这是考虑具体架构做出的优化,像《Computer Architecture》中就介绍了很多这种技术,笔者印象比较深刻的就是寄存器重命名和机器周期优化了,主要是流水线化算法

代码生成:最终生成目标机器代码或汇编代码,并写入目标文件

编译器的设计思想从整体来看,就是函数式的思想,笔者猜测可能是与编译理论直接源于可计算理论有关,同时,编译器领域是最接近计算机原理的领域,像正则表达式和自动机这些东西,很难想象它们的发源居然来自神经网络理论。

以上就是编译执行流程了,现在让我们来看看C4的代码。

C4源码

MAIN

C4只用了四个函数: next( ), expr( ), stmt( ), main( )

next( ): 词法分析器,解析标识符(由字母、数字和下划线组成),解析数字等并转化为标记。

expr( ): 语法分析器中解析表达式的部分,根据不同的文法,解析并生成表达式的中间代码。主要是运算符的解析为主,例如解引用,+,-,/,*,在这一步,被解析的程序将会被构造为中间形式,C4的中间形式是虚拟机可识别的指令。

stmt( ):解析和执行不同类型的语句,例如if、while等语句。

main( ): C4的编码风格比较简陋,排版不是非常美观,main函数不仅完成分配符号表、文本区、数据区和栈区的内存,虚拟机也被放在了main函数中,语法分析的一部分也被放在了main函数,对每个声明,确定基本类型、处理指针、检测重复定义,并添加到符号表,最后运行虚拟机。

框架如下:

1
2
3
graph LR
Aa(词法分析器next)-->Ab(表达式解析expr)-->Ac(语句解析stmt)-->Ad(中间代码)-->Ae(虚拟机执行)
Ba(main函数)-->Bs(开辟内存空间段)-->Bb(解析声明)-->Bc(词法分析)-->Bd(语法分析)-->Be(虚拟机执行)

词法分析

各个词法单元:

1
2
3
4
5
6
// tokens and classes (operators last and in precedence order)
enum {
Num = 128, Fun, Sys, Glo, Loc, Id,
Char, Else, Enum, If, Int, Return, Sizeof, While,
Assign, Cond, Lor, Lan, Or, Xor, And, Eq, Ne, Lt, Gt, Le, Ge, Shl, Shr, Add, Sub, Mul, Div, Mod, Inc, Dec, Brak
};

词法分析器的返回是tk,它是一个全局变量,那么也很容易想象语法分析器的构造过程了,先判断tk的值,获取类型后生成对应中间形式。

词法分析器入口如下,第一个解析的标识符是”\n”,while(le < e)是判断文本指针位置打印,之后是解析’#’,当然,由于C4不支持宏定义,因此也跳过,当获取词素与支持的类型没有匹配时,词法分析器会跳过,例如空格符:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void next()
{
char *pp;

while (tk = *p) {
++p;
if (tk == '\n') {
if (src) {
printf("%d: %.*s", line, p - lp, lp);
lp = p;
while (le < e) {
printf("%8.4s", &"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
"OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
"OPEN,READ,CLOS,PRTF,MALC,FREE,MSET,MCMP,EXIT,"[*++le * 5]);
if (*le <= ADJ) printf(" %d\n", *++le); else printf("\n");
}
}
++line;
}
else if (tk == '#') {
while (*p != 0 && *p != '\n') ++p;
}

当词法分析器运行到这里,意味着能够与标识符匹配的词素出现了,于是计算并保存哈希值,然后把数组地址往后偏移一个单独符号表的大小,这样就可以继续存储新的词素:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
else if ((tk >= 'a' && tk <= 'z') || (tk >= 'A' && tk <= 'Z') || tk == '_') {
pp = p - 1;
while ((*p >= 'a' && *p <= 'z') || (*p >= 'A' && *p <= 'Z') || (*p >= '0' && *p <= '9') || *p == '_')
tk = tk * 147 + *p++;
tk = (tk << 6) + (p - pp);
id = sym;
while (id[Tk]) {
if (tk == id[Hash] && !memcmp((char *)id[Name], pp, p - pp)) { tk = id[Tk]; return; }
id = id + Idsz;
}
id[Name] = (int)pp;
id[Hash] = tk;
tk = id[Tk] = Id;
return;
}

对这些数字、字符串和注释的解析如下,(tk & 15)这种写法,其实就是%16,也就是解析十六进制,这种写法很常见:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
else if (tk >= '0' && tk <= '9') {
if (ival = tk - '0') { while (*p >= '0' && *p <= '9') ival = ival * 10 + *p++ - '0'; }
else if (*p == 'x' || *p == 'X') {
while ((tk = *++p) && ((tk >= '0' && tk <= '9') || (tk >= 'a' && tk <= 'f') || (tk >= 'A' && tk <= 'F')))
ival = ival * 16 + (tk & 15) + (tk >= 'A' ? 9 : 0);
}
else { while (*p >= '0' && *p <= '7') ival = ival * 8 + *p++ - '0'; }
tk = Num;
return;
}
else if (tk == '/') {
if (*p == '/') {
++p;
while (*p != 0 && *p != '\n') ++p;
}
else {
tk = Div;
return;
}
}
else if (tk == '\'' || tk == '"') {
pp = data;
while (*p != 0 && *p != tk) {
if ((ival = *p++) == '\\') {
if ((ival = *p++) == 'n') ival = '\n';
}
if (tk == '"') *data++ = ival;
}
++p;
if (tk == '"') ival = (int)pp; else tk = Num;
return;
}

解析单目运算符和标点符号·:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
else if (tk == '=') { if (*p == '=') { ++p; tk = Eq; } else tk = Assign; return; }
else if (tk == '+') { if (*p == '+') { ++p; tk = Inc; } else tk = Add; return; }
else if (tk == '-') { if (*p == '-') { ++p; tk = Dec; } else tk = Sub; return; }
else if (tk == '!') { if (*p == '=') { ++p; tk = Ne; } return; }
else if (tk == '<') { if (*p == '=') { ++p; tk = Le; } else if (*p == '<') { ++p; tk = Shl; } else tk = Lt; return; }
else if (tk == '>') { if (*p == '=') { ++p; tk = Ge; } else if (*p == '>') { ++p; tk = Shr; } else tk = Gt; return; }
else if (tk == '|') { if (*p == '|') { ++p; tk = Lor; } else tk = Or; return; }
else if (tk == '&') { if (*p == '&') { ++p; tk = Lan; } else tk = And; return; }
else if (tk == '^') { tk = Xor; return; }
else if (tk == '%') { tk = Mod; return; }
else if (tk == '*') { tk = Mul; return; }
else if (tk == '[') { tk = Brak; return; }
else if (tk == '?') { tk = Cond; return; }
else if (tk == '~' || tk == ';' || tk == '{' || tk == '}' || tk == '(' || tk == ')' || tk == ']' || tk == ',' || tk == ':') return;
}

语法分析

C4的语法分析分为声明(declaration)、表达式(expr)和语句(stmt)的解析。

常见的解析手段有通用、自顶向下和自底向上三种,C4采用的是自顶向下解析。

语法分析在C4中的框架如下:

1
2
3
4
graph LR
Aa(源程序)--输入-->Ab(词法分析)--词素-->Ac(语法分析)--中间形式-->Af(虚拟机)-->At(执行)
Ac--获取下一个词素-->Ab

还有一个重要的部分,那就是符号表:

1
2
3
4
graph LR
Aa(词法分析)-->Ab(语法分析)
Aa-->Ba(符号表)
Ab---->Ba(符号表)

解析声明

词法分析已经帮助我们捕捉到了词素的类型,现在该语法分析登场了,首先是解析全局声明,这些类型有:int、char、enum、函数

C4采用自顶向下解析,但是本身并不支持递归,所以不能使用递归下降进行解析。

自顶向下解析:

假设我们有文法如下:

1
2
3
E->TE'
E'->+ TE' | 空
T->FT'

自顶向下解析如下:

1
2
3
graph TD

E

其实就是树的遍历展开:

1
2
3
4
5
6
7
8
9
10
11
12
13
graph TD
A((A))
B((B))
C((C))
D((D))
E((E))


A(E)-->B(T)
A(E)-->C(E')
B-->D(F)
B-->E(T')

推完左边再推右边就行了,就不多展示了。

文法

int和char的文法不需要过多讲述,重点讲一讲enum和函数的文法:

enum文法:

1
2
3
4
5
<EnumDecl> ::= "enum" <EnumName> "{" <EnumList> "}"
<EnumName> ::= "IDENTIFIER"
<EnumList> ::= <EnumItem> | <EnumItem> "," <EnumList>
<EnumItem> ::= "IDENTIFIER"

把C4中的文法写成一行,这样看起来就简洁多了:

1
2
enum的文法:enum_decl ::= 'enum' [identifier] '{' identifier ['=' number] {',' identifier ['=' number] '}'

函数文法:

1
2
3
4
5
6
<FuncDecl> ::= <ReturnType> <FuncName> "(" <ParamList> ")" "{" <FuncBody> "}"
<ReturnType> ::= "void" | "int" | "char" | "IDENTIFIER"
<FuncName> ::= "IDENTIFIER"
<ParamList> ::= <Param> | <Param> "," <ParamList> | ""
<FuncBody> ::= <StatementList>

以下就是C4中对应的文法解析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// parse declarations
line = 1;
next();
while (tk) {
bt = INT; // basetype
if (tk == Int) next();
else if (tk == Char) { next(); bt = CHAR; }
else if (tk == Enum) {//enum的解析开始
next();
if (tk != '{') next();
if (tk == '{') {
next();
i = 0;
while (tk != '}') {
if (tk != Id) { printf("%d: bad enum identifier %d\n", line, tk); return -1; }
next();
if (tk == Assign) {
next();
if (tk != Num) { printf("%d: bad enum initializer\n", line); return -1; }
i = ival;
next();
}
id[Class] = Num; id[Type] = INT; id[Val] = i++;
if (tk == ',') next();
}
next();
}
}
while (tk != ';' && tk != '}') {
ty = bt;
while (tk == Mul) { next(); ty = ty + PTR; }
if (tk != Id) { printf("%d: bad global declaration\n", line); return -1; }
if (id[Class]) { printf("%d: duplicate global definition\n", line); return -1; }
next();
id[Type] = ty;
if (tk == '(') { // 函数解析从这里开始
id[Class] = Fun;
id[Val] = (int)(e + 1);
next(); i = 0;
while (tk != ')') {
ty = INT;
if (tk == Int) next();
else if (tk == Char) { next(); ty = CHAR; }
while (tk == Mul) { next(); ty = ty + PTR; }
if (tk != Id) { printf("%d: bad parameter declaration\n", line); return -1; }
if (id[Class] == Loc) { printf("%d: duplicate parameter definition\n", line); return -1; }
id[HClass] = id[Class]; id[Class] = Loc;
id[HType] = id[Type]; id[Type] = ty;
id[HVal] = id[Val]; id[Val] = i++;
next();
if (tk == ',') next();
}
next();
if (tk != '{') { printf("%d: bad function definition\n", line); return -1; }
loc = ++i;
next();
while (tk == Int || tk == Char) {
bt = (tk == Int) ? INT : CHAR;
next();
while (tk != ';') {
ty = bt;
while (tk == Mul) { next(); ty = ty + PTR; }
if (tk != Id) { printf("%d: bad local declaration\n", line); return -1; }
if (id[Class] == Loc) { printf("%d: duplicate local definition\n", line); return -1; }
id[HClass] = id[Class]; id[Class] = Loc;
id[HType] = id[Type]; id[Type] = ty;
id[HVal] = id[Val]; id[Val] = ++i;
next();
if (tk == ',') next();
}
next();
}
*++e = ENT; *++e = i - loc;
while (tk != '}') stmt();
*++e = LEV;
id = sym; // unwind symbol table locals
while (id[Tk]) {
if (id[Class] == Loc) {
id[Class] = id[HClass];
id[Type] = id[HType];
id[Val] = id[HVal];
}
id = id + Idsz;
}
}
else {
id[Class] = Glo;
id[Val] = (int)data;
data = data + sizeof(int);
}
if (tk == ',') next();
}
next();
}

声明解析完,接下来就是解析表达式(expr)和语句(stmt),顺便一提,C4的声明解析是放在main函数中进行的,也就是说,C4的声明必须在文法解析之前,像在函数语句后面定义变量的行为,C4是不支持的。

解析表达式

我们从词法分析那里得到了词素的类型,当判断它是表达式后,我们就可以开始解析了,把它转换为虚拟机可识别的指令流,表达式的解析就是一元运算符、二元运算符和三元与算法的解析,例如加减乘除、与或非这些,这些东西的文法就不详细将了,详细地把每种运算符的解析文法都表示出来有点难受,毕竟笔者的分析只是以框架为主。

从框架中我们可以得知,语法分析就是在往内存模型中输出,之后就是虚拟机了,所以解析结果重点是虚拟机指令。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

void expr(int lev)
{
int t, *d;

if (!tk) { printf("%d: unexpected eof in expression\n", line); exit(-1); }
else if (tk == Num) { *++e = IMM; *++e = ival; next(); ty = INT; }
else if (tk == '"') {
*++e = IMM; *++e = ival; next();
while (tk == '"') next();
data = (char *)((int)data + sizeof(int) & -sizeof(int)); ty = PTR;
}
else if (tk == Sizeof) {
next(); if (tk == '(') next(); else { printf("%d: open paren expected in sizeof\n", line); exit(-1); }
ty = INT; if (tk == Int) next(); else if (tk == Char) { next(); ty = CHAR; }
while (tk == Mul) { next(); ty = ty + PTR; }
if (tk == ')') next(); else { printf("%d: close paren expected in sizeof\n", line); exit(-1); }
*++e = IMM; *++e = (ty == CHAR) ? sizeof(char) : sizeof(int);
ty = INT;
}
else if (tk == Id) {
d = id; next();
if (tk == '(') {
next();
t = 0;
while (tk != ')') { expr(Assign); *++e = PSH; ++t; if (tk == ',') next(); }
next();
if (d[Class] == Sys) *++e = d[Val];
else if (d[Class] == Fun) { *++e = JSR; *++e = d[Val]; }
else { printf("%d: bad function call\n", line); exit(-1); }
if (t) { *++e = ADJ; *++e = t; }
ty = d[Type];
}
else if (d[Class] == Num) { *++e = IMM; *++e = d[Val]; ty = INT; }
else {
if (d[Class] == Loc) { *++e = LEA; *++e = loc - d[Val]; }
else if (d[Class] == Glo) { *++e = IMM; *++e = d[Val]; }
else { printf("%d: undefined variable\n", line); exit(-1); }
*++e = ((ty = d[Type]) == CHAR) ? LC : LI;
}
}
else if (tk == '(') {
next();
if (tk == Int || tk == Char) {
t = (tk == Int) ? INT : CHAR; next();
while (tk == Mul) { next(); t = t + PTR; }
if (tk == ')') next(); else { printf("%d: bad cast\n", line); exit(-1); }
expr(Inc);
ty = t;
}
else {
expr(Assign);
if (tk == ')') next(); else { printf("%d: close paren expected\n", line); exit(-1); }
}
}
else if (tk == Mul) {
next(); expr(Inc);
if (ty > INT) ty = ty - PTR; else { printf("%d: bad dereference\n", line); exit(-1); }
*++e = (ty == CHAR) ? LC : LI;
}
else if (tk == And) {
next(); expr(Inc);
if (*e == LC || *e == LI) --e; else { printf("%d: bad address-of\n", line); exit(-1); }
ty = ty + PTR;
}
else if (tk == '!') { next(); expr(Inc); *++e = PSH; *++e = IMM; *++e = 0; *++e = EQ; ty = INT; }
else if (tk == '~') { next(); expr(Inc); *++e = PSH; *++e = IMM; *++e = -1; *++e = XOR; ty = INT; }
else if (tk == Add) { next(); expr(Inc); ty = INT; }
else if (tk == Sub) {
next(); *++e = IMM;
if (tk == Num) { *++e = -ival; next(); } else { *++e = -1; *++e = PSH; expr(Inc); *++e = MUL; }
ty = INT;
}
else if (tk == Inc || tk == Dec) {
t = tk; next(); expr(Inc);
if (*e == LC) { *e = PSH; *++e = LC; }
else if (*e == LI) { *e = PSH; *++e = LI; }
else { printf("%d: bad lvalue in pre-increment\n", line); exit(-1); }
*++e = PSH;
*++e = IMM; *++e = (ty > PTR) ? sizeof(int) : sizeof(char);
*++e = (t == Inc) ? ADD : SUB;
*++e = (ty == CHAR) ? SC : SI;
}
else { printf("%d: bad expression\n", line); exit(-1); }

while (tk >= lev) { // "precedence climbing" or "Top Down Operator Precedence" method
t = ty;
if (tk == Assign) {
next();
if (*e == LC || *e == LI) *e = PSH; else { printf("%d: bad lvalue in assignment\n", line); exit(-1); }
expr(Assign); *++e = ((ty = t) == CHAR) ? SC : SI;
}
else if (tk == Cond) {
next();
*++e = BZ; d = ++e;
expr(Assign);
if (tk == ':') next(); else { printf("%d: conditional missing colon\n", line); exit(-1); }
*d = (int)(e + 3); *++e = JMP; d = ++e;
expr(Cond);
*d = (int)(e + 1);
}
else if (tk == Lor) { next(); *++e = BNZ; d = ++e; expr(Lan); *d = (int)(e + 1); ty = INT; }
else if (tk == Lan) { next(); *++e = BZ; d = ++e; expr(Or); *d = (int)(e + 1); ty = INT; }
else if (tk == Or) { next(); *++e = PSH; expr(Xor); *++e = OR; ty = INT; }
else if (tk == Xor) { next(); *++e = PSH; expr(And); *++e = XOR; ty = INT; }
else if (tk == And) { next(); *++e = PSH; expr(Eq); *++e = AND; ty = INT; }
else if (tk == Eq) { next(); *++e = PSH; expr(Lt); *++e = EQ; ty = INT; }
else if (tk == Ne) { next(); *++e = PSH; expr(Lt); *++e = NE; ty = INT; }
else if (tk == Lt) { next(); *++e = PSH; expr(Shl); *++e = LT; ty = INT; }
else if (tk == Gt) { next(); *++e = PSH; expr(Shl); *++e = GT; ty = INT; }
else if (tk == Le) { next(); *++e = PSH; expr(Shl); *++e = LE; ty = INT; }
else if (tk == Ge) { next(); *++e = PSH; expr(Shl); *++e = GE; ty = INT; }
else if (tk == Shl) { next(); *++e = PSH; expr(Add); *++e = SHL; ty = INT; }
else if (tk == Shr) { next(); *++e = PSH; expr(Add); *++e = SHR; ty = INT; }
else if (tk == Add) {
next(); *++e = PSH; expr(Mul);
if ((ty = t) > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL; }
*++e = ADD;
}
else if (tk == Sub) {
next(); *++e = PSH; expr(Mul);
if (t > PTR && t == ty) { *++e = SUB; *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = DIV; ty = INT; }
else if ((ty = t) > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL; *++e = SUB; }
else *++e = SUB;
}
else if (tk == Mul) { next(); *++e = PSH; expr(Inc); *++e = MUL; ty = INT; }
else if (tk == Div) { next(); *++e = PSH; expr(Inc); *++e = DIV; ty = INT; }
else if (tk == Mod) { next(); *++e = PSH; expr(Inc); *++e = MOD; ty = INT; }
else if (tk == Inc || tk == Dec) {
if (*e == LC) { *e = PSH; *++e = LC; }
else if (*e == LI) { *e = PSH; *++e = LI; }
else { printf("%d: bad lvalue in post-increment\n", line); exit(-1); }
*++e = PSH; *++e = IMM; *++e = (ty > PTR) ? sizeof(int) : sizeof(char);
*++e = (tk == Inc) ? ADD : SUB;
*++e = (ty == CHAR) ? SC : SI;
*++e = PSH; *++e = IMM; *++e = (ty > PTR) ? sizeof(int) : sizeof(char);
*++e = (tk == Inc) ? SUB : ADD;
next();
}
else if (tk == Brak) {
next(); *++e = PSH; expr(Assign);
if (tk == ']') next(); else { printf("%d: close bracket expected\n", line); exit(-1); }
if (t > PTR) { *++e = PSH; *++e = IMM; *++e = sizeof(int); *++e = MUL; }
else if (t < PTR) { printf("%d: pointer type expected\n", line); exit(-1); }
*++e = ADD;
*++e = ((ty = t - PTR) == CHAR) ? LC : LI;
}
else { printf("%d: compiler error tk=%d\n", line, tk); exit(-1); }
}
}

解析语句

C4解析的语句,包括 if 语句、while 语句、return 语句、块语句和表达式语句。

解析文法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<stmt> ::= 'if' '(' <expr> ')' <stmt> ['else' <stmt>]
| 'while' '(' <expr> ')' <stmt>
| 'return' [<expr>] ';'
| '{' {<stmt>} '}'
| ';'
| <expr> ';'

<expr> ::= <identifier> '=' <expr>
| <identifier> '+' <expr>
| <identifier> '-' <expr>
| <identifier> '*' <expr>
| <identifier> '/' <expr>
| <identifier>
| <number>

<identifier> ::= [A-Za-z_][A-Za-z0-9_]*
<number> ::= [0-9]+


对应程序如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

void stmt()
{
int *a, *b;

if (tk == If) {
next();
if (tk == '(') next(); else { printf("%d: open paren expected\n", line); exit(-1); }
expr(Assign);
if (tk == ')') next(); else { printf("%d: close paren expected\n", line); exit(-1); }
*++e = BZ; b = ++e;
stmt();
if (tk == Else) {
*b = (int)(e + 3); *++e = JMP; b = ++e;
next();
stmt();
}
*b = (int)(e + 1);
}
else if (tk == While) {
next();
a = e + 1;
if (tk == '(') next(); else { printf("%d: open paren expected\n", line); exit(-1); }
expr(Assign);
if (tk == ')') next(); else { printf("%d: close paren expected\n", line); exit(-1); }
*++e = BZ; b = ++e;
stmt();
*++e = JMP; *++e = (int)a;
*b = (int)(e + 1);
}
else if (tk == Return) {
next();
if (tk != ';') expr(Assign);
*++e = LEV;
if (tk == ';') next(); else { printf("%d: semicolon expected\n", line); exit(-1); }
}
else if (tk == '{') {
next();
while (tk != '}') stmt();
next();
}
else if (tk == ';') {
next();
}
else {
expr(Assign);
if (tk == ';') next(); else { printf("%d: semicolon expected\n", line); exit(-1); }
}
}

虚拟机

虚拟机是C4运行的核心, “程序的80%的运行时间花在20%的代码上”,理解了虚拟机,就理解了C4的框架。

现在让我们先看看虚拟机的实现,从虚拟机中我们可以看见C4工作原理,不过再此之前笔者要讲一讲内存模型。

1
2
graph LR
Aa(词法分析)-->Ab(语法分析)-->Ac(输出到内存模型)-->Ad(虚拟机根据内存信息执行)-->Ae(输出结果)

内存模型

c常见的elf文件模型如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
+-----------------------+
| 栈段 (Stack) |
| (局部变量、函数调用帧) |
+-----------------------+
| 堆段 (Heap) |
| (动态分配的内存) |
+-----------------------+
| 未初始化数据段 (BSS) |
| (未初始化的全局和静态变量)|
+-----------------------+
| 已初始化数据段 (Data) |
| (已初始化的全局和静态变量)|
+-----------------------+
| 只读数据段 |
| (常量字符串、只读变量)|
+-----------------------+
| 代码段 (Text) |
| (程序指令、常量等) |
+-----------------------+

这是编译器编译后的文件,可以看出不同的内存段存储不同信息。

同理,C4也实现了类似的内存模型,让我们看看main函数:

C4的内存模型并不是严格的elf文件格式,只有text段、数据段、栈。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
int main(int argc, char **argv)
{
int fd, bt, ty, poolsz, *idmain;
int *pc, *sp, *bp, a, cycle; // vm registers
int i, *t; // temps

--argc; ++argv;
if (argc > 0 && **argv == '-' && (*argv)[1] == 's') { src = 1; --argc; ++argv; }
if (argc > 0 && **argv == '-' && (*argv)[1] == 'd') { debug = 1; --argc; ++argv; }
if (argc < 1) { printf("usage: c4 [-s] [-d] file ...\n"); return -1; }

if ((fd = open(*argv, 0)) < 0) { printf("could not open(%s)\n", *argv); return -1; }

poolsz = 256*1024; // arbitrary size
if (!(sym = malloc(poolsz))) { printf("could not malloc(%d) symbol area\n", poolsz); return -1; }
if (!(le = e = malloc(poolsz))) { printf("could not malloc(%d) text area\n", poolsz); return -1; }
if (!(data = malloc(poolsz))) { printf("could not malloc(%d) data area\n", poolsz); return -1; }
if (!(sp = malloc(poolsz))) { printf("could not malloc(%d) stack area\n", poolsz); return -1; }

memset(sym, 0, poolsz);
memset(e, 0, poolsz);
memset(data, 0, poolsz);

p = "char else enum if int return sizeof while "
"open read close printf malloc free memset memcmp exit void main";
i = Char; while (i <= While) { next(); id[Tk] = i++; } // add keywords to symbol table
i = OPEN; while (i <= EXIT) { next(); id[Class] = Sys; id[Type] = INT; id[Val] = i++; } // add library to symbol table
next(); id[Tk] = Char; // handle void type
next(); idmain = id; // keep track of main

if (!(lp = p = malloc(poolsz))) { printf("could not malloc(%d) source area\n", poolsz); return -1; }
if ((i = read(fd, p, poolsz-1)) <= 0) { printf("read() returned %d\n", i); return -1; }
p[i] = 0;
close(fd);

虚拟机的执行

虚拟机会从text段开始,一个个指令读,然后实现对应的操作,

C4的虚拟机实现具体如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// setup stack
bp = sp = (int *)((int)sp + poolsz);
*--sp = EXIT; // call exit if main returns
*--sp = PSH; t = sp;
*--sp = argc;
*--sp = (int)argv;
*--sp = (int)t;

// run...
cycle = 0;
while (1) {
i = *pc++; ++cycle;
if (debug) {
printf("%d> %.4s", cycle,
&"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
"OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
"OPEN,READ,CLOS,PRTF,MALC,FREE,MSET,MCMP,EXIT,"[i * 5]);
if (i <= ADJ) printf(" %d\n", *pc); else printf("\n");
}
if (i == LEA) a = (int)(bp + *pc++); // load local address
else if (i == IMM) a = *pc++; // load global address or immediate
else if (i == JMP) pc = (int *)*pc; // jump
else if (i == JSR) { *--sp = (int)(pc + 1); pc = (int *)*pc; } // jump to subroutine
else if (i == BZ) pc = a ? pc + 1 : (int *)*pc; // branch if zero
else if (i == BNZ) pc = a ? (int *)*pc : pc + 1; // branch if not zero
else if (i == ENT) { *--sp = (int)bp; bp = sp; sp = sp - *pc++; } // enter subroutine
else if (i == ADJ) sp = sp + *pc++; // stack adjust
else if (i == LEV) { sp = bp; bp = (int *)*sp++; pc = (int *)*sp++; } // leave subroutine
else if (i == LI) a = *(int *)a; // load int
else if (i == LC) a = *(char *)a; // load char
else if (i == SI) *(int *)*sp++ = a; // store int
else if (i == SC) a = *(char *)*sp++ = a; // store char
else if (i == PSH) *--sp = a; // push

else if (i == OR) a = *sp++ | a;
else if (i == XOR) a = *sp++ ^ a;
else if (i == AND) a = *sp++ & a;
else if (i == EQ) a = *sp++ == a;
else if (i == NE) a = *sp++ != a;
else if (i == LT) a = *sp++ < a;
else if (i == GT) a = *sp++ > a;
else if (i == LE) a = *sp++ <= a;
else if (i == GE) a = *sp++ >= a;
else if (i == SHL) a = *sp++ << a;
else if (i == SHR) a = *sp++ >> a;
else if (i == ADD) a = *sp++ + a;
else if (i == SUB) a = *sp++ - a;
else if (i == MUL) a = *sp++ * a;
else if (i == DIV) a = *sp++ / a;
else if (i == MOD) a = *sp++ % a;

else if (i == OPEN) a = open((char *)sp[1], *sp);
else if (i == READ) a = read(sp[2], (char *)sp[1], *sp);
else if (i == CLOS) a = close(*sp);
else if (i == PRTF) { t = sp + pc[1]; a = printf((char *)t[-1], t[-2], t[-3], t[-4], t[-5], t[-6]); }
else if (i == MALC) a = (int)malloc(*sp);
else if (i == FREE) free((void *)*sp);
else if (i == MSET) a = (int)memset((char *)sp[2], sp[1], *sp);
else if (i == MCMP) a = memcmp((char *)sp[2], (char *)sp[1], *sp);
else if (i == EXIT) { printf("exit(%d) cycle = %d\n", *sp, cycle); return *sp; }
else { printf("unknown instruction = %d! cycle = %d\n", i, cycle); return -1; }
}

举个例子,我们要计算1+1,要怎么写指令呢?虚拟机会怎么执行呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


int main() {
int text[1000];
int i = 0;

text[i++] = IMM; text[i++] = 1;
text[i++] = PSH;
text[i++] = IMM; text[i++] = 1;
text[i++] = PSH;
text[i++] = ADD;
text[i++] = SI; text[i++] = 0;
text[i++] = LI; text[i++] = 0;
text[i++] = PRTF; text[i++] = 0;
text[i++] = EXIT;

后面是虚拟机的代码:
// run...
cycle = 0;
while (1) {
虚拟机代码太长,懒得复制粘贴了。
}
}

如上所示,虚拟机执行步骤如下:

加载立即数 1并压入栈:

1
2
text[i++] = IMM;   text[i++] = 1;     
text[i++] = PSH;

再加载:

1
2
text[i++] = IMM;   text[i++] = 1;    
text[i++] = PSH;

栈顶两个数相加:

1
text[i++] = ADD;                 

将结果存储到位置 0:

1
text[i++] = SI;    text[i++] = 0;    

从位置 0 加载值到栈顶:

1
text[i++] = LI;    text[i++] = 0;

打印栈顶值:

1
text[i++] = PRTF;  text[i++] = 0;

退出程序:

1
text[i++] = EXIT;   

虚拟机的x86版本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
while (pc <= e) {
i = *pc;
if (src) {
while (line < srcmap[pc - text]) {
line++; printf("% 4d | %.*s", line, linemap[line + 1] - linemap[line], linemap[line]);
}
printf("0x%05x (%p):\t%8.4s", pc - text, je,
&"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
"OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
"OPEN,READ,CLOS,PRTF,MALC,MSET,MCMP,MCPY,MMAP,DOPN,DSYM,QSRT,EXIT,"[i * 5]);
if (i <= ADJ) printf(" 0x%x\n", *(pc + 1)); else printf("\n");
}
jitmap[pc - text] = je; // for later relocation of JMP/JSR/BZ/BNZ
pc++;
if (i == LEA) {
i = 4 * *pc++; if (i < -128 || i > 127) { printf("jit: LEA out of bounds\n"); return -1; }
*(int*)je = 0x458d; je = je + 2; *je++ = i; // leal $(4 * n)(%ebp), %eax
}
else if (i == ENT) {
i = 4 * *pc++; if (i < -128 || i > 127) { printf("jit: ENT out of bounds\n"); return -1; }
*(int *)je = 0xe58955; je = je + 3; // push %ebp; movl %esp, %ebp
if (i > 0) { *(int *)je = 0xec83; je = je + 2; *(int*)je++ = i; } // subl $(i*4), %esp
}
else if (i == IMM) { *je++ = 0xb8; *(int *)je = *pc++; je = je + 4; } // movl $imm, %eax
else if (i == ADJ) { i = 4 * *pc++; *(int *)je = 0xc483; je = je + 2; *(int *)je = i; je++; } // addl $(n * 4), %esp
else if (i == PSH) *(int *)je++ = 0x50; // push %eax
else if (i == LEV) { *(int *)je = 0xc35dec89; je = je + 4; } // mov %ebp, %esp; pop %ebp; ret
else if (i == LI) { *(int *)je = 0x008b; je = je + 2; } // movl (%eax), %eax
else if (i == LC) { *(int *)je = 0x00b60f; je = je + 3; } // movzbl (%eax), %eax
else if (i == SI) { *(int *)je = 0x018959; je = je + 3; } // pop %ecx; movl %eax, (%ecx)
else if (i == SC) { *(int *)je = 0x018859; je = je + 3; } // pop %ecx; movb %al, (%ecx)
else if (i == OR) { *(int *)je = 0xc80959; je = je + 3; } // pop %ecx; orl %ecx, %eax
else if (i == XOR) { *(int *)je = 0xc83159; je = je + 3; } // pop %ecx; xorl %ecx, %eax
else if (i == AND) { *(int *)je = 0xc82159; je = je + 3; } // pop %ecx; andl %ecx, %eax
else if (EQ <= i && i <= GE) {
*(int*)je=0x0fc13959; je = je + 4; *(int*)je=0x9866c094; // pop %ecx; cmp %ecx, %eax; sete %al; cbw; - EQ
if (i == NE) { *je = 0x95; } // setne %al
else if (i == LT) { *je = 0x9c; } // setl %al
else if (i == GT) { *je = 0x9f; } // setg %al
else if (i == LE) { *je = 0x9e; } // setle %al
else if (i == GE) { *je = 0x9d; } // setge %al
je=je+4; *je++=0x98; // cwde
}
else if (i == SHL) { *(int*)je = 0xe0d39159; je = je + 4; } // pop %ecx; xchg %eax, %ecx; shl %cl, %eax
else if (i == SHR) { *(int*)je = 0xe8d39159; je = je + 4; } // pop %ecx; xchg %eax, %ecx; shr %cl, %eax
else if (i == ADD) { *(int*)je = 0xc80159; je = je + 3; } // pop %ecx; addl %ecx, %eax
else if (i == SUB) { *(int*)je = 0xc8299159; je = je + 4; } // pop %ecx; xchg %eax, %ecx; subl %ecx, %eax
else if (i == MUL) { *(int*)je = 0xc1af0f59; je = je + 4; } // pop %ecx; imul %ecx, %eax
else if (i == DIV) { *(int*)je = 0xf9f79159; je = je + 4; } // pop %ecx; xchg %eax, %ecx; idiv %ecx, %eax
else if (i == MOD) { *(int*)je = 0xd2319159; je = je + 4; *(int *)je = 0x92f9f7; je = je + 3; }
else if (i == JMP) { ++pc; *je = 0xe9; je = je + 5; } // jmp <off32>
else if (i == JSR) { ++pc; *je = 0xe8; je = je + 5; } // call <off32>
else if (i == BZ) { ++pc; *(int*)je = 0x840fc085; je = je + 8; } // test %eax, %eax; jz <off32>
else if (i == BNZ) { ++pc; *(int*)je = 0x850fc085; je = je + 8; } // test %eax, %eax; jnz <off32>
else if (i >= OPEN) {
if (i == OPEN) tmp = (int)dlsym(dl, "open");
else if (i == READ) tmp = (int)dlsym(dl, "read");
else if (i == CLOS) tmp = (int)dlsym(dl, "close");
else if (i == PRTF) tmp = (int)dlsym(dl, "printf");
else if (i == MALC) tmp = (int)dlsym(dl, "malloc");
else if (i == MSET) tmp = (int)dlsym(dl, "memset");
else if (i == MCMP) tmp = (int)dlsym(dl, "memcmp");
else if (i == MCPY) tmp = (int)dlsym(dl, "memcpy");
else if (i == MMAP) tmp = (int)dlsym(dl, "mmap");
else if (i == DOPN) tmp = (int)dlsym(dl, "dlopen");
else if (i == DSYM) tmp = (int)dlsym(dl, "dlsym");
else if (i == QSRT) tmp = (int)dlsym(dl, "qsort");
else if (i == EXIT) tmp = (int)dlsym(dl, "exit");

if (*pc++ == ADJ) { i = *pc++; } else { printf("no ADJ after native proc!\n"); exit(2); }

*je++ = 0xb9; *(int*)je = i << 2; je = je + 4; // movl $(4 * n), %ecx;
*(int*)je = 0xce29e689; je = je + 4; // mov %esp, %esi; sub %ecx, %esi; -- %esi will adjust the stack
*(int*)je = 0x8302e9c1; je = je + 4; // shr $2, %ecx; and -- alignment of %esp for OS X
*(int*)je = 0x895af0e6; je = je + 4; // $0xfffffff0, %esi; pop %edx; mov..
*(int*)je = 0xe2fc8e54; je = je + 4; // ..%edx, -4(%esi,%ecx,4); loop.. -- reversing args order
*(int*)je = 0xe8f487f9; je = je + 4; // ..<'pop' offset>; xchg %esi, %esp; call -- saving old stack in %esi
*(int*)je = tmp - (int)(je + 4); je = je + 4; // <*tmp offset>;
*(int*)je = 0xf487; je = je + 2; // xchg %esi, %esp -- ADJ, back to old stack without arguments
}
else { printf("code generation failed for %d!\n", i); return -1; }
}

赋值给je的十六进制是x86架构下的操作码,这样C4就能解析x86下的部分c语言文件了。

让我们再回头看一下C4框架:

1
2
graph LR
Aa(词法分析)-->Ab(语法分析)-->Ac(输出到内存模型)-->Ad(虚拟机根据内存信息执行)-->Ae(输出结果)

词法分析会捕捉每一个词法单元,也就是词素,然后它会返回对应的结果,语法分析根据返回的结果,可以确定是哪一个类型的词素并执行对应的操作,解析的结果会被放在内存模型中,当解析完后,内存模型也就完成了编译的写入,然后虚拟机开始干活了,它根据内存模型内的信息,从text段开始执行,text段全部都是指令,当然,肯定有读者好奇,怎么从数据段访问那些变量?这个不必担心,全局和静态变量在编译时已经分配了特定的内存地址。在程序执行时,虚拟机通过这些地址访问变量。