• C编译器01-扫描器


    acwj编译器从0开始到编译器自举

    到网上找到一个acwj项目。这个项目的目标就是一步步开发一个可以“自举”的c编译器。很有意思的一个开源项目,从扫描器一点点实现一个完整编译器。好处就是从0开始一点点累积,不像成熟编译器突然就庞大到感觉无从下手。

    官方是下面介绍的:

    Part 0: Introduction

    I’ve decided to go on a compiler writing journey. In the past I’ve written some
    assemblers, and
    I’ve written a simple compiler
    for a typeless language. But I’ve never written a compiler that can compile
    itself. So that’s where I’m headed on this journey.

    As part of the process, I’m going to write up my work so that others can
    follow along. This will also help me to clarify my thoughts and ideas.
    Hopefully you, and I, will find this useful!

    Goals of the Journey

    Here are my goals, and non-goals, for the journey:

    • To write a self-compiling compiler. I think that if the compiler can
      compile itself, it gets to call itself a real compiler.
    • To target at least one real hardware platform. I’ve seen a few compilers
      that generate code for hypothetical machines. I want my compiler to
      work on real hardware. Also, if possible, I want to write the compiler
      so that it can support multiple backends for different hardware platforms.
    • Practical before research. There’s a whole lot of research in the area of
      compilers. I want to start from absolute zero on this journey, so I’ll
      tend to go for a practical approach and not a theory-heavy approach. That
      said, there will be times when I’ll need to introduce (and implement) some
      theory-based stuff.
    • Follow the KISS principle: keep it simple, stupid! I’m definitely going to
      be using Ken Thompson’s principle here: “When in doubt, use brute force.”
    • Take a lot of small steps to reach the final goal. I’ll break the journey
      up into a lot of simple steps instead of taking large leaps. This will
      make each new addition to the compiler a bite-sized and easily digestible
      thing.

    Target Language

    The choice of a target language is difficult. If I choose a high-level
    language like Python, Go etc., then I’ll have to implement a whole pile
    of libraries and classes as they are built-in to the language.

    I could write a compiler for a language like Lisp, but these can be
    done easily.

    Instead, I’ve fallen back on the old standby and I’m going to write a
    compiler for a subset of C, enough to allow the compiler to compile
    itself.

    C is just a step up from assembly language (for some subset of C, not
    C18), and this
    will help make the task of compiling the C code down to assembly somewhat
    easier. Oh, and I also like C.

    The Basics of a Compiler’s Job

    The job of a compiler is to translate input in one language (usually
    a high-level language) into a different output language (usually a
    lower-level language than the input). The main steps are:

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3USRO4sy-1662281843554)(Figs/parsing_steps.png)]

    • Do lexical analysis
      to recognise the lexical elements. In several languages, = is different
      to ==, so you can’t just read a single =. We call these lexical
      elements tokens.
    • Parse the input, i.e. recognise
      the syntax and structural elements of the input and ensure that they
      conform to the grammar of the language. For example, your language
      might have this decision-making
      structure:
          if (x < 23) {
            print("x is smaller than 23\n");
          }
    
    • 1
    • 2
    • 3

    but in another language you might write:

          if (x < 23):
            print("x is smaller than 23\n")
    
    • 1
    • 2

    This is also the place where the compiler can detect syntax errors, like if
    the semicolon was missing on the end of the first print statement.

    • Do semantic analysis
      of the input, i.e. understand the meaning of the input. This is actually different
      from recognising the syntax and structure. For example, in English, a
      sentence might have the form .
      The following two sentences have the same structure, but completely
      different meaning:
                David ate lovely bananas.
                Jennifer hates green tomatoes.
      
      • 1
      • 2
      • Translate
        the meaning of the input into a different language. Here we
        convert the input, parts at a time, into a lower-level language.

      Resources

      There’s a lot of compiler resources out on the Internet. Here are the ones
      I’ll be looking at.

      Learning Resources

      If you want to start with some books, papers and tools on compilers,
      I’d highly recommend this list:

      Existing Compilers

      While I’m going to build my own compiler, I plan on looking at other compilers
      for ideas and probably also borrow some of their code. Here are the ones
      I’m looking at:

      In particular, I’ll be using a lot of the ideas, and some of the code,
      from the SubC compiler.

      Setting Up the Development Environment

      Assuming that you want to come along on this journey, here’s what you’ll
      need. I’m going to use a Linux development environment, so download and
      set up your favourite Linux system: I’m using Lubuntu 18.04.

      I’m going to target two hardware platforms: Intel x86-64 and 32-bit ARM.
      I’ll use a PC running Lubuntu 18.04 as the Intel target, and a Raspberry
      Pi running Raspbian as the ARM target.

      On the Intel platform, we are going to need an existing C compiler.
      So, install this package (I give the Ubuntu/Debian commands):

        $ sudo apt-get install build-essential
      
      • 1

      If there are any more tools required for a vanilla Linux
      system, let me know.

      Finally, clone a copy of this Github repository.

      The Next Step

      In the next part of our compiler writing journey, we will start with
      the code to scan our input file and find the tokens that are the
      lexical elements of our language.

      首先第一个实现一个简单的扫描器。把代码的token块扫描出来。

      先定义全局变量记录扫描到的行数和回退字符及扫描文件。
      data.h

      #ifndef extern_
      #define extern_ extern
      #endif
      
      // Global variables
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      // Global variables
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      //当前读的行数
      extern_ int Line;
      //记录最后一个回退的字符
      extern_ int	Putback;
      //读的代码文件
      extern_ FILE* Infile;
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17

      申明扫描方法
      decl.h

      
      // Function prototypes for all compiler files
      // Copyright (c) 2019 Warren Toomey, GPL3
      //扫描方法申明
      int scan(struct token* t);
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6

      开始实现的扫描器只实现加、减、乘、除和整数扫描。为每个类别定义枚举。和扫描块的结构体,扫描每个元素块是一个token对象。为啥这里元素类型要用枚举不用直接字符串,是因为后面还有多字符的类型。枚举可以简化操作。
      defs.h

      #include 
      #include 
      #include 
      #include 
      
      // Structure and enum definitions
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      // Structure and enum definitions
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      // Tokens
      // 扫描得到的令牌类别
      // 加、减、乘、除、整数
      enum {
      	T_PLUS, T_MINUS, T_STAR, T_SLASH, T_INTLIT
      };
      
      /*我们从一个简单的词法扫描器开始我们的编译器编写之旅。
      正如我在上一部分中提到的,扫描仪的工作
      是识别输入语言中的词汇元素或*tokens*。
      我们将从只有五个词汇元素的语言开始:
       + 四个基本数学运算符:`*`、`/`、`+` 和 `-`
       + 具有 1 个或多个数字的十进制整数 `0` .. `9`
      我们扫描的每个令牌都将存储在这个结构中*/
      
      /*当标记是一个`T_INTLIT`(即一个整数文字)时,`intvalue`
      字段将保存我们扫描的整数的值。*/
      
      // 令牌结构体
      // Token structure
      struct token {
      	//令牌类型
      	int token;
      	//令牌值
      	int intvalue;
      };
      
      
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40

      实现从代码文件扫描一个token的方法。
      scan.c

      #include "defs.h"
      #include "data.h"
      #include "decl.h"
      
      // Lexical scanning
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      // Lexical scanning
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      /*包含我们的词法扫描器的功能。我们要去了
      从我们的输入文件中一次读取一个字符。然而,会有
      如果我们读得太远,有时我们需要“放回”一个字符
      在输入流中领先。我们还想跟踪我们目前在哪条线路
      以便我们可以在调试消息中打印行号。所有这些
      由 `next()` 函数完成*/
      
      // Return the position of character c
      // in string s, or -1 if c not found
      //找到字符c在字符串s的位置,没找到返回-1
      static int chrpos(char* s, int c) {
      	char* p;
      	p = strchr(s, c);
      	return (p ? p - s : -1);
      }
      
      //从输入文件里得到下一个字符位置
      // Get the next character from the input file.
      static int next(void) {
      	int c;
      	//有回退字符就返回回退字符
      	if (Putback) {		// Use the character put
      		c = Putback;		// back if there is one
      		Putback = 0;
      		return c;
      	}
      	//读文件
      	c = fgetc(Infile);		// Read from input file
      	//读到换行符就把行数加1
      	if ('\n' == c)
      		Line++;			// Increment line count
      	  //返回下一个字符位置
      	return c;
      }
      
      //回退一个不想要的字符
      // Put back an unwanted character
      static void putback(int c) {
      	Putback = c;
      }
      
      //尝试跳过不想要的字符
      // Skip past input that we don't need to deal with, 
      // i.e. whitespace, newlines. Return the first
      // character we do need to deal with.
      static int skip(void) {
      	int c;
      	c = next();
      	//空格、TAB、换行、回车等跳过
      	while (' ' == c || '\t' == c || '\n' == c || '\r' == c || '\f' == c) {
      		c = next();
      	}
      	return (c);
      }
      
      //扫描数字
      // Scan and return an integer literal
      // value from the input file. Store
      // the value as a string in Text.
      static int scanint(int c) {
      	int k, val = 0;
      	//循环检查字符,直到不是数字
      	// Convert each character into an int value
      	while ((k = chrpos("0123456789", c)) >= 0) {
      		val = val * 10 + k;
      		c = next();
      	}
      	//命中了非数字字符就抛回去
      	// We hit a non-integer character, put it back.
      	putback(c);
      	return val;
      }
      
      // 扫描并返回在输入中找到的下一个标记。
      // 如果令牌有效,则返回 1,如果没有令牌,则返回 0。
      // Scan and return the next token found in the input.
      // Return 1 if token valid, 0 if no tokens left.
      int scan(struct token* t) {
      	int c;
      	//先跳过忽略字符到第一个有效字符
      	// Skip whitespace
      	c = skip();
      
      	// Determine the token based on
      	// the input character
      	switch (c) {
      		//结束
      	case EOF:
      		return (0);
      		//加
      	case '+':
      		t->token = T_PLUS;
      		break;
      		//减
      	case '-':
      		t->token = T_MINUS;
      		break;
      		//乘
      	case '*':
      		t->token = T_STAR;
      		break;
      		//除
      	case '/':
      		t->token = T_SLASH;
      		break;
      	default:
      	  //判断是否是数字
      	  // If it's a digit, scan the
      	  // literal integer value in
      		if (isdigit(c)) {
      			t->intvalue = scanint(c);
      			t->token = T_INTLIT;
      			break;
      		}
      		//正常到default会退出,执行到这里就是无法识别的字符
      		printf("在第%d行无法识别的字符%c\n", Line, c);
      		exit(1);
      	}
      	//找到一个令牌
      	// We found a token
      	return (1);
      }
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74
      • 75
      • 76
      • 77
      • 78
      • 79
      • 80
      • 81
      • 82
      • 83
      • 84
      • 85
      • 86
      • 87
      • 88
      • 89
      • 90
      • 91
      • 92
      • 93
      • 94
      • 95
      • 96
      • 97
      • 98
      • 99
      • 100
      • 101
      • 102
      • 103
      • 104
      • 105
      • 106
      • 107
      • 108
      • 109
      • 110
      • 111
      • 112
      • 113
      • 114
      • 115
      • 116
      • 117
      • 118
      • 119
      • 120
      • 121
      • 122
      • 123
      • 124
      • 125
      • 126
      • 127
      • 128
      • 129
      • 130
      • 131
      • 132
      • 133

      循环调用扫描方法,把代码里所有token扫描出来并且打印。
      main.c

      #include "defs.h"
      #define extern_
      #include "data.h"
      #undef extern_
      #include "decl.h"
      #include 
      
      // Compiler setup and top-level execution
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      // Compiler setup and top-level execution
      // Copyright (c) 2019 Warren Toomey, GPL3
      
      //初始化全局变量
      // Initialise global variables
      static void init() {
      	//行数设置1
      	Line = 1;
      	//抛回变量设置为换行
      	Putback = '\n';
      }
      
      //打印使用示例
      // Print out a usage if started incorrectly
      //prog:程序名
      static void usage(char* prog) {
      	fprintf(stderr, "使用示例: %s 输入文件名\n", prog);
      	exit(1);
      }
      
      //可打印令牌列表
      // List of printable tokens
      char* tokstr[] = { "+", "-", "*", "/", "intlit" };
      
      //扫描文件,找到所有令牌并且打印
      // Loop scanning in all the tokens in the input file.
      // Print out details of each token found.
      static void scanfile() {
      	struct token T;
      	//循环扫描得到令牌
      	while (scan(&T)) {
      		//打印类型
      		printf("Token %s", tokstr[T.token]);
      		//打印数组
      		if (T.token == T_INTLIT)
      		{
      			printf(", value %d", T.intvalue);
      		}
      		printf("\n");
      	}
      }
      
      // Main program: check arguments and print a usage
      // if we don't have an argument. Open up the input
      // file and call scanfile() to scan the tokens in it.
      void main(int argc, char* argv[]) {
      	//参数个数小于2打印使用提示
      	if (argc != 2)
      	{
      		//把程序名字传给打印
      		usage(argv[0]);
      	}
      	//初始化
      	init();
      	//打开输入参数的文件
      	if ((Infile = fopen(argv[1], "r")) == NULL) {
      		fprintf(stderr, "不能打开 %s: %s\n", argv[1], strerror(errno));
      		exit(1);
      	}
      	//扫描文件
      	scanfile();
      	exit(0);
      }
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74

      代码传到centos编译测试。

      测试代码
      input01

      2 + 3 * 5 - 8 / 3
      
      • 1

      input02

      13 -6+  4*
      5
             +
      08 / 3
      
      
      • 1
      • 2
      • 3
      • 4
      • 5

      input03

      12 34 + -56 * / - - 8 + * 2
      
      
      • 1
      • 2

      input04

      23 +
      18 -
      45.6 * 2
      / 18
      
      
      • 1
      • 2
      • 3
      • 4
      • 5

      input05

      23 * 456abcdefg
      
      
      • 1
      • 2

      编译和测试

      [root@zlzlinux zlz]# 
      [root@zlzlinux zlz]# 
      [root@zlzlinux zlz]# cd c
      [root@zlzlinux c]# ls
      00_Introduction   09_While_Loops      18_Lvalues_Revisited   27_Testing_Errors       36_Break_Continue          45_Globals_Again   54_Reg_Spills       63_QBE
      01_Scanner        10_For_Loops        19_Arrays_pt1          28_Runtime_Flags        37_Switch                  46_Void_Functions  55_Lazy_Evaluation  LICENSE
      02_Parser         11_Functions_pt1    20_Char_Str_Literals   29_Refactoring          38_Dangling_Else           47_Sizeof          56_Local_Arrays     Readme.md
      03_Precedence     12_Types_pt1        21_More_Operators      30_Design_Composites    39_Var_Initialisation_pt1  48_Static          57_Mop_up_pt3
      04_Assembly       13_Functions_pt2    22_Design_Locals       31_Struct_Declarations  40_Var_Initialisation_pt2  49_Ternary         58_Ptr_Increments
      05_Statements     14_ARM_Platform     23_Local_Variables     32_Struct_Access_pt1    41_Local_Var_Init          50_Mop_up_pt1      59_WDIW_pt1
      06_Variables      15_Pointers_pt1     24_Function_Params     33_Unions               42_Casting                 51_Arrays_pt2      60_TripleTest
      07_Comparisons    16_Global_Vars      25_Function_Arguments  34_Enums_and_Typedefs   43_More_Operators          52_Pointers_pt2    61_What_Next
      08_If_Statements  17_Scaling_Offsets  26_Prototypes          35_Preprocessor         44_Fold_Optimisation       53_Mop_up_pt2      62_Cleanup
      [root@zlzlinux c]# cd 01_Scanner/
      [root@zlzlinux 01_Scanner]# ls
      data.h  decl.h  defs.h  input01  input02  input03  input04  input05  main.c  Makefile  Readme.md  scan.c
      [root@zlzlinux 01_Scanner]# make
      cc -o scanner -g main.c scan.c
      [root@zlzlinux 01_Scanner]# ls
      data.h  decl.h  defs.h  input01  input02  input03  input04  input05  main.c  Makefile  Readme.md  scan.c  scanner
      [root@zlzlinux 01_Scanner]# ./scanner input01
      Token intlit, value 2
      Token +
      Token intlit, value 3
      Token *
      Token intlit, value 5
      Token -
      Token intlit, value 8
      Token /
      Token intlit, value 3
      [root@zlzlinux 01_Scanner]# cat input01
      2 + 3 * 5 - 8 / 3
      [root@zlzlinux 01_Scanner]# ./scanner input02
      Token intlit, value 13
      Token -
      Token intlit, value 6
      Token +
      Token intlit, value 4
      Token *
      Token intlit, value 5
      Token +
      Token intlit, value 8
      Token /
      Token intlit, value 3
      [root@zlzlinux 01_Scanner]# ./scanner input03
      Token intlit, value 12
      Token intlit, value 34
      Token +
      Token -
      Token intlit, value 56
      Token *
      Token /
      Token -
      Token -
      Token intlit, value 8
      Token +
      Token *
      Token intlit, value 2
      [root@zlzlinux 01_Scanner]# ./scanner input04
      Token intlit, value 23
      Token +
      Token intlit, value 18
      Token -
      Token intlit, value 45
      ՚µسѐϞ·¨ʶ±𶅗ַ 
      [root@zlzlinux 01_Scanner]# cat input04
      23 +
      18 -
      45.6 * 2
      / 18
      [root@zlzlinux 01_Scanner]# ./scanner input04
      Token intlit, value 23
      Token +
      Token intlit, value 18
      Token -
      Token intlit, value 45
      在第3行无法识别的字符.
      [root@zlzlinux 01_Scanner]# cat input04
      23 +
      18 -
      45.6 * 2
      / 18
      [root@zlzlinux 01_Scanner]# ./scanner input05
      Token intlit, value 23
      Token *
      Token intlit, value 456
      在第1行无法识别的字符a
      [root@zlzlinux 01_Scanner]# 
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53
      • 54
      • 55
      • 56
      • 57
      • 58
      • 59
      • 60
      • 61
      • 62
      • 63
      • 64
      • 65
      • 66
      • 67
      • 68
      • 69
      • 70
      • 71
      • 72
      • 73
      • 74
      • 75
      • 76
      • 77
      • 78
      • 79
      • 80
      • 81
      • 82
      • 83
      • 84
      • 85
      • 86
      • 87
      • 88

      第一篇简单的扫码器就是这样的,细读代码会发现扫描器的代码还是挺精妙的。

    • 相关阅读:
      扩散模型基本原理
      彻底搞懂Swagger&Knife4j使用方法
      DayDayUp:计算机技术与软件专业技术资格证书之《系统集成项目管理工程师》课程讲解之十大知识领域之4辅助—项目沟通管理
      6个团队建设管理游戏
      java-dubbo接口测试三种方式
      第四章 数字逻辑电路设计方法【Verilog】
      LeetCode简单题之出现最频繁的偶数元素
      微服务框架 SpringCloud微服务架构 16 SpringAMQP 16.6 FanoutExchange
      day26 java Stream
      SPA首屏加载速度慢
    • 原文地址:https://blog.csdn.net/zhanglianzhu_91/article/details/126691641