acwj编译器从0开始到编译器自举
到网上找到一个acwj项目。这个项目的目标就是一步步开发一个可以“自举”的c编译器。很有意思的一个开源项目,从扫描器一点点实现一个完整编译器。好处就是从0开始一点点累积,不像成熟编译器突然就庞大到感觉无从下手。
官方是下面介绍的:
I’ve decided to go on a compiler writing journey. In the past I’ve written some
assemblers, and
I’ve written a simple compiler
for a typeless language. But I’ve never written a compiler that can compile
itself. So that’s where I’m headed on this journey.
As part of the process, I’m going to write up my work so that others can
follow along. This will also help me to clarify my thoughts and ideas.
Hopefully you, and I, will find this useful!
Here are my goals, and non-goals, for the journey:
The choice of a target language is difficult. If I choose a high-level
language like Python, Go etc., then I’ll have to implement a whole pile
of libraries and classes as they are built-in to the language.
I could write a compiler for a language like Lisp, but these can be
done easily.
Instead, I’ve fallen back on the old standby and I’m going to write a
compiler for a subset of C, enough to allow the compiler to compile
itself.
C is just a step up from assembly language (for some subset of C, not
C18), and this
will help make the task of compiling the C code down to assembly somewhat
easier. Oh, and I also like C.
The job of a compiler is to translate input in one language (usually
a high-level language) into a different output language (usually a
lower-level language than the input). The main steps are:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3USRO4sy-1662281843554)(Figs/parsing_steps.png)]
=
is different==
, so you can’t just read a single =
. We call these lexical if (x < 23) {
print("x is smaller than 23\n");
}
but in another language you might write:
if (x < 23):
print("x is smaller than 23\n")
This is also the place where the compiler can detect syntax errors, like if
the semicolon was missing on the end of the first print statement.
. David ate lovely bananas.
Jennifer hates green tomatoes.
There’s a lot of compiler resources out on the Internet. Here are the ones
I’ll be looking at.
If you want to start with some books, papers and tools on compilers,
I’d highly recommend this list:
While I’m going to build my own compiler, I plan on looking at other compilers
for ideas and probably also borrow some of their code. Here are the ones
I’m looking at:
In particular, I’ll be using a lot of the ideas, and some of the code,
from the SubC compiler.
Assuming that you want to come along on this journey, here’s what you’ll
need. I’m going to use a Linux development environment, so download and
set up your favourite Linux system: I’m using Lubuntu 18.04.
I’m going to target two hardware platforms: Intel x86-64 and 32-bit ARM.
I’ll use a PC running Lubuntu 18.04 as the Intel target, and a Raspberry
Pi running Raspbian as the ARM target.
On the Intel platform, we are going to need an existing C compiler.
So, install this package (I give the Ubuntu/Debian commands):
$ sudo apt-get install build-essential
If there are any more tools required for a vanilla Linux
system, let me know.
Finally, clone a copy of this Github repository.
In the next part of our compiler writing journey, we will start with
the code to scan our input file and find the tokens that are the
lexical elements of our language.
首先第一个实现一个简单的扫描器。把代码的token块扫描出来。
先定义全局变量记录扫描到的行数和回退字符及扫描文件。
data.h
#ifndef extern_
#define extern_ extern
#endif
// Global variables
// Copyright (c) 2019 Warren Toomey, GPL3
// Global variables
// Copyright (c) 2019 Warren Toomey, GPL3
//当前读的行数
extern_ int Line;
//记录最后一个回退的字符
extern_ int Putback;
//读的代码文件
extern_ FILE* Infile;
申明扫描方法
decl.h
// Function prototypes for all compiler files
// Copyright (c) 2019 Warren Toomey, GPL3
//扫描方法申明
int scan(struct token* t);
开始实现的扫描器只实现加、减、乘、除和整数扫描。为每个类别定义枚举。和扫描块的结构体,扫描每个元素块是一个token对象。为啥这里元素类型要用枚举不用直接字符串,是因为后面还有多字符的类型。枚举可以简化操作。
defs.h
#include
#include
#include
#include
// Structure and enum definitions
// Copyright (c) 2019 Warren Toomey, GPL3
// Structure and enum definitions
// Copyright (c) 2019 Warren Toomey, GPL3
// Tokens
// 扫描得到的令牌类别
// 加、减、乘、除、整数
enum {
T_PLUS, T_MINUS, T_STAR, T_SLASH, T_INTLIT
};
/*我们从一个简单的词法扫描器开始我们的编译器编写之旅。
正如我在上一部分中提到的,扫描仪的工作
是识别输入语言中的词汇元素或*tokens*。
我们将从只有五个词汇元素的语言开始:
+ 四个基本数学运算符:`*`、`/`、`+` 和 `-`
+ 具有 1 个或多个数字的十进制整数 `0` .. `9`
我们扫描的每个令牌都将存储在这个结构中*/
/*当标记是一个`T_INTLIT`(即一个整数文字)时,`intvalue`
字段将保存我们扫描的整数的值。*/
// 令牌结构体
// Token structure
struct token {
//令牌类型
int token;
//令牌值
int intvalue;
};
实现从代码文件扫描一个token的方法。
scan.c
#include "defs.h"
#include "data.h"
#include "decl.h"
// Lexical scanning
// Copyright (c) 2019 Warren Toomey, GPL3
// Lexical scanning
// Copyright (c) 2019 Warren Toomey, GPL3
/*包含我们的词法扫描器的功能。我们要去了
从我们的输入文件中一次读取一个字符。然而,会有
如果我们读得太远,有时我们需要“放回”一个字符
在输入流中领先。我们还想跟踪我们目前在哪条线路
以便我们可以在调试消息中打印行号。所有这些
由 `next()` 函数完成*/
// Return the position of character c
// in string s, or -1 if c not found
//找到字符c在字符串s的位置,没找到返回-1
static int chrpos(char* s, int c) {
char* p;
p = strchr(s, c);
return (p ? p - s : -1);
}
//从输入文件里得到下一个字符位置
// Get the next character from the input file.
static int next(void) {
int c;
//有回退字符就返回回退字符
if (Putback) { // Use the character put
c = Putback; // back if there is one
Putback = 0;
return c;
}
//读文件
c = fgetc(Infile); // Read from input file
//读到换行符就把行数加1
if ('\n' == c)
Line++; // Increment line count
//返回下一个字符位置
return c;
}
//回退一个不想要的字符
// Put back an unwanted character
static void putback(int c) {
Putback = c;
}
//尝试跳过不想要的字符
// Skip past input that we don't need to deal with,
// i.e. whitespace, newlines. Return the first
// character we do need to deal with.
static int skip(void) {
int c;
c = next();
//空格、TAB、换行、回车等跳过
while (' ' == c || '\t' == c || '\n' == c || '\r' == c || '\f' == c) {
c = next();
}
return (c);
}
//扫描数字
// Scan and return an integer literal
// value from the input file. Store
// the value as a string in Text.
static int scanint(int c) {
int k, val = 0;
//循环检查字符,直到不是数字
// Convert each character into an int value
while ((k = chrpos("0123456789", c)) >= 0) {
val = val * 10 + k;
c = next();
}
//命中了非数字字符就抛回去
// We hit a non-integer character, put it back.
putback(c);
return val;
}
// 扫描并返回在输入中找到的下一个标记。
// 如果令牌有效,则返回 1,如果没有令牌,则返回 0。
// Scan and return the next token found in the input.
// Return 1 if token valid, 0 if no tokens left.
int scan(struct token* t) {
int c;
//先跳过忽略字符到第一个有效字符
// Skip whitespace
c = skip();
// Determine the token based on
// the input character
switch (c) {
//结束
case EOF:
return (0);
//加
case '+':
t->token = T_PLUS;
break;
//减
case '-':
t->token = T_MINUS;
break;
//乘
case '*':
t->token = T_STAR;
break;
//除
case '/':
t->token = T_SLASH;
break;
default:
//判断是否是数字
// If it's a digit, scan the
// literal integer value in
if (isdigit(c)) {
t->intvalue = scanint(c);
t->token = T_INTLIT;
break;
}
//正常到default会退出,执行到这里就是无法识别的字符
printf("在第%d行无法识别的字符%c\n", Line, c);
exit(1);
}
//找到一个令牌
// We found a token
return (1);
}
循环调用扫描方法,把代码里所有token扫描出来并且打印。
main.c
#include "defs.h"
#define extern_
#include "data.h"
#undef extern_
#include "decl.h"
#include
// Compiler setup and top-level execution
// Copyright (c) 2019 Warren Toomey, GPL3
// Compiler setup and top-level execution
// Copyright (c) 2019 Warren Toomey, GPL3
//初始化全局变量
// Initialise global variables
static void init() {
//行数设置1
Line = 1;
//抛回变量设置为换行
Putback = '\n';
}
//打印使用示例
// Print out a usage if started incorrectly
//prog:程序名
static void usage(char* prog) {
fprintf(stderr, "使用示例: %s 输入文件名\n", prog);
exit(1);
}
//可打印令牌列表
// List of printable tokens
char* tokstr[] = { "+", "-", "*", "/", "intlit" };
//扫描文件,找到所有令牌并且打印
// Loop scanning in all the tokens in the input file.
// Print out details of each token found.
static void scanfile() {
struct token T;
//循环扫描得到令牌
while (scan(&T)) {
//打印类型
printf("Token %s", tokstr[T.token]);
//打印数组
if (T.token == T_INTLIT)
{
printf(", value %d", T.intvalue);
}
printf("\n");
}
}
// Main program: check arguments and print a usage
// if we don't have an argument. Open up the input
// file and call scanfile() to scan the tokens in it.
void main(int argc, char* argv[]) {
//参数个数小于2打印使用提示
if (argc != 2)
{
//把程序名字传给打印
usage(argv[0]);
}
//初始化
init();
//打开输入参数的文件
if ((Infile = fopen(argv[1], "r")) == NULL) {
fprintf(stderr, "不能打开 %s: %s\n", argv[1], strerror(errno));
exit(1);
}
//扫描文件
scanfile();
exit(0);
}
代码传到centos编译测试。
测试代码
input01
2 + 3 * 5 - 8 / 3
input02
13 -6+ 4*
5
+
08 / 3
input03
12 34 + -56 * / - - 8 + * 2
input04
23 +
18 -
45.6 * 2
/ 18
input05
23 * 456abcdefg
编译和测试
[root@zlzlinux zlz]#
[root@zlzlinux zlz]#
[root@zlzlinux zlz]# cd c
[root@zlzlinux c]# ls
00_Introduction 09_While_Loops 18_Lvalues_Revisited 27_Testing_Errors 36_Break_Continue 45_Globals_Again 54_Reg_Spills 63_QBE
01_Scanner 10_For_Loops 19_Arrays_pt1 28_Runtime_Flags 37_Switch 46_Void_Functions 55_Lazy_Evaluation LICENSE
02_Parser 11_Functions_pt1 20_Char_Str_Literals 29_Refactoring 38_Dangling_Else 47_Sizeof 56_Local_Arrays Readme.md
03_Precedence 12_Types_pt1 21_More_Operators 30_Design_Composites 39_Var_Initialisation_pt1 48_Static 57_Mop_up_pt3
04_Assembly 13_Functions_pt2 22_Design_Locals 31_Struct_Declarations 40_Var_Initialisation_pt2 49_Ternary 58_Ptr_Increments
05_Statements 14_ARM_Platform 23_Local_Variables 32_Struct_Access_pt1 41_Local_Var_Init 50_Mop_up_pt1 59_WDIW_pt1
06_Variables 15_Pointers_pt1 24_Function_Params 33_Unions 42_Casting 51_Arrays_pt2 60_TripleTest
07_Comparisons 16_Global_Vars 25_Function_Arguments 34_Enums_and_Typedefs 43_More_Operators 52_Pointers_pt2 61_What_Next
08_If_Statements 17_Scaling_Offsets 26_Prototypes 35_Preprocessor 44_Fold_Optimisation 53_Mop_up_pt2 62_Cleanup
[root@zlzlinux c]# cd 01_Scanner/
[root@zlzlinux 01_Scanner]# ls
data.h decl.h defs.h input01 input02 input03 input04 input05 main.c Makefile Readme.md scan.c
[root@zlzlinux 01_Scanner]# make
cc -o scanner -g main.c scan.c
[root@zlzlinux 01_Scanner]# ls
data.h decl.h defs.h input01 input02 input03 input04 input05 main.c Makefile Readme.md scan.c scanner
[root@zlzlinux 01_Scanner]# ./scanner input01
Token intlit, value 2
Token +
Token intlit, value 3
Token *
Token intlit, value 5
Token -
Token intlit, value 8
Token /
Token intlit, value 3
[root@zlzlinux 01_Scanner]# cat input01
2 + 3 * 5 - 8 / 3
[root@zlzlinux 01_Scanner]# ./scanner input02
Token intlit, value 13
Token -
Token intlit, value 6
Token +
Token intlit, value 4
Token *
Token intlit, value 5
Token +
Token intlit, value 8
Token /
Token intlit, value 3
[root@zlzlinux 01_Scanner]# ./scanner input03
Token intlit, value 12
Token intlit, value 34
Token +
Token -
Token intlit, value 56
Token *
Token /
Token -
Token -
Token intlit, value 8
Token +
Token *
Token intlit, value 2
[root@zlzlinux 01_Scanner]# ./scanner input04
Token intlit, value 23
Token +
Token intlit, value 18
Token -
Token intlit, value 45
՚µسѐϞ·¨ʶ±ַ
[root@zlzlinux 01_Scanner]# cat input04
23 +
18 -
45.6 * 2
/ 18
[root@zlzlinux 01_Scanner]# ./scanner input04
Token intlit, value 23
Token +
Token intlit, value 18
Token -
Token intlit, value 45
在第3行无法识别的字符.
[root@zlzlinux 01_Scanner]# cat input04
23 +
18 -
45.6 * 2
/ 18
[root@zlzlinux 01_Scanner]# ./scanner input05
Token intlit, value 23
Token *
Token intlit, value 456
在第1行无法识别的字符a
[root@zlzlinux 01_Scanner]#
第一篇简单的扫码器就是这样的,细读代码会发现扫描器的代码还是挺精妙的。