请开发一个RISC-V汇编器,支持自动将以下RISC-V汇编指令编译成RISC-V机器码。要求如下:
(1)该汇编器需支持的汇编命令包含:lui,sw,addi,add。
(2)汇编器运行方式:创建一个名为source.s的汇编源代码文本,随意挑选4条lui,sw,addi,add汇编测试语句,将这4条汇编语句写入到source.s文本中。然后用开发的汇编器运行该文本,并将生成的机器码输出到另一个名为binary.txt的文本中。最后再比较生成的机器码与附件1中的机器码是否一致,从而验证汇编器的功能(生成的机器码在binary.txt中用16进制表示)。
(3)汇编器功能要求如下:
a)能进行语法纠错,例如:如果输入luik a3, 0x200,会进行报错,并指出语法错误的位置luik。
b)能检查参数的非法范围,例如,若输入lui a3, 0x2000000,则会提醒0x2000000超出合法界限。
c)不考虑压缩型指令,所以指令机器码均为4字节(32位)。
(4)汇编器使用的开发语言没有限制。本文使用Python写。
下面将定义Assembler
类,即这个汇编器的核心,它包含几个方法来编码不同的汇编指令。首先进行初始化,再定义encode方法(它接受一个汇编指令字符串作为输入,然后基于指令的类型调用相应的编码方法),然后分别定义lui,sw,addi,add具体的encode编码方法,在每个方法中定义转换机器码、语法纠错、检查参数范围的功能。
基于RISC-V的RV32I基本整数指令集的约定给出注册名和对应的编号。其中的键是RISC-V的寄存器名称,而值是它们对应的5位二进制编码。
- def __init__(self):
- # 注册名和其对应的编号
- self.registers = {
- "zero": "00000",
- "ra": "00001",
- "sp": "00010",
- "gp": "00011",
- "tp": "00100",
- "t0": "00101",
- "t1": "00110",
- "t2": "00111",
- "s0": "01000",
- "s1": "01001",
- "a0": "01010",
- "a1": "01011",
- "a2": "01100",
- "a3": "01101",
- "a4": "01110",
- "a5": "01111",
- "a6": "10000",
- "a7": "10001",
- "s2": "10010",
- "s3": "10011",
- "s4": "10100",
- "s5": "10101",
- "s6": "10110",
- "s7": "10111",
- "s8": "11000",
- "s9": "11001",
- "s10": "11010",
- "s11": "11011",
- "t3": "11100",
- "t4": "11101",
- "t5": "11110",
- "t6": "11111"
- }
在RISC-V中,这些寄存器有特定的用途。例如:
zero
(x0):硬连线为0的寄存器,任何写入操作都会被忽略。ra
(x1):返回地址。sp
(x2):堆栈指针。gp
(x3):全局指针。tp
(x4):线程指针。
之后的寄存器,如t0-t6
,是临时寄存器,用于一般的操作,而a0-a7
用于函数参数和返回值。s0-s11
是保存寄存器,它们在调用过程中的值应该被保留。
它接受一个汇编指令字符串作为输入,然后基于指令的类型调用相应的编码方法。
- def encode(self, instruction):
- if "lui" in instruction:
- return self.encode_lui(instruction)
- elif "sw" in instruction:
- return self.encode_sw(instruction)
- elif "addi" in instruction:
- return self.encode_addi(instruction)
- elif "add" in instruction:
- return self.encode_add(instruction)
- else:
- raise ValueError(f"Unsupported instruction: {instruction}")
其中的instruction就是我们将来要传递的汇编语句!这里只是初步判断语句的正确性:起码要保证句子里面含有lui,sw,addi,add吧?不含有的话就提前报错,这也算写的格式不对!
- def encode_lui(self, instruction):
- # 检查语法格式
- if not re.match(r'^lui\s+[a-z0-9]+,\s*0x[a-fA-F0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- rd = self.registers.get(tokens[1].replace(',', ''))
- if rd is None:
- raise ValueError(f"Unknown register: {tokens[1]}")
- imm = bin(int(tokens[2], 16))[2:].zfill(20)
- if len(imm) > 20:
- raise ValueError(f"Immediate value out of bounds: {tokens[2]}")
- opcode = "0110111"
- return imm + rd + opcode
-
- def encode_sw(self, instruction):
- # 检查语法格式
- if not re.match(r'^sw\s+[a-z0-9]+,\s*[0-9]+\([a-z0-9]+\)$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = [token for token in re.split('[ ,()]', instruction) if token]
- rs2 = self.registers.get(tokens[1])
- rs1 = self.registers.get(tokens[3])
- if rs1 is None or rs2 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- offset = bin(int(tokens[2], 16))[2:].zfill(12)
- if len(offset) > 12:
- raise ValueError(f"Offset out of bounds: {tokens[2]}")
- opcode = "0100011"
- funct3 = "010"
-
- # Split the offset for the instruction format
- imm_11_5 = offset[:7]
- imm_4_0 = offset[7:]
- return imm_11_5 + rs2 + rs1 + funct3 + imm_4_0 + opcode
-
- def encode_addi(self, instruction):
- # 检查语法格式
- if not re.match(r'^addi\s+[a-z0-9]+,\s*[a-z0-9]+,\s*-?[0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- rd = self.registers.get(tokens[1].replace(',', ''))
- rs1 = self.registers.get(tokens[2].replace(',', ''))
- if rd is None or rs1 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- imm = bin(int(tokens[3]))[2:].zfill(12)
- if len(imm) > 12:
- raise ValueError(f"Immediate value out of bounds: {tokens[3]}")
- opcode = "0010011"
- funct3 = "000"
- return imm + rs1 + funct3 + rd + opcode
-
- def encode_add(self, instruction):
- # 检查语法格式
- if not re.match(r'^add\s+[a-z0-9]+,\s*[a-z0-9]+,\s*[a-z0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- rd = self.registers.get(tokens[1].replace(',', ''))
- rs1 = self.registers.get(tokens[2].replace(',', ''))
- rs2 = self.registers.get(tokens[3])
- if rd is None or rs1 is None or rs2 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- opcode = "0110011"
- funct3 = "000"
- funct7 = "0000000"
- return funct7 + rs2 + rs1 + funct3 + rd + opcode
这些函数都大差不差,可以概括为:首先通过正则表达式检查语法格式,如果存在“luik”这种错误格式,就报错;然后将指令分解为tokens,便于转换机器码;然后就开始转换呗!转换规则都是基于RISC-V的RV32I基本整数指令集的约定。
这就涉及到Python的读写,对source.s文件中每一行指令读写后,将其内容赋给instructions,再使用核心调度方法encode( ),最后将结果以hex十六进制的格式写入binary.txt文件中。
- def from_file(self, input_file="source.s"):
- with open(input_file, 'r') as file:
- instructions = file.readlines()
- binary_codes = [self.encode(instruction.strip()) for instruction in instructions]
- with open("binary.txt", 'w') as out_file:
- for code in binary_codes:
- out_file.write(hex(int(code, 2))[2:].zfill(8) + '\n')
以上都是Assembler
类的定义,也就是汇编器的定义。下面我们再创建汇编器的实例assembler,调用from_file( )函数就可以:
- assembler = Assembler()
- assembler.from_file()
下面我们对结果进行检验,准备好source.s文件:
- lui a5, 0x20000
- addi a5, a5, 12
- sw a4, 4(a5)
- add a4, a4, a5
运行后产生binary.txt文件,内容如下:
- 200007b7
- 00c78793
- 00e7a223
- 00f70733
与实际汇编器的内容进行比较发现,本文开发的汇编器结果正确!至此,问题一解决!完整源码如下:
- import re
-
- class Assembler:
- def __init__(self):
- # 注册名和其对应的编号
- self.registers = {
- "zero": "00000",
- "ra": "00001",
- "sp": "00010",
- "gp": "00011",
- "tp": "00100",
- "t0": "00101",
- "t1": "00110",
- "t2": "00111",
- "s0": "01000",
- "s1": "01001",
- "a0": "01010",
- "a1": "01011",
- "a2": "01100",
- "a3": "01101",
- "a4": "01110",
- "a5": "01111",
- "a6": "10000",
- "a7": "10001",
- "s2": "10010",
- "s3": "10011",
- "s4": "10100",
- "s5": "10101",
- "s6": "10110",
- "s7": "10111",
- "s8": "11000",
- "s9": "11001",
- "s10": "11010",
- "s11": "11011",
- "t3": "11100",
- "t4": "11101",
- "t5": "11110",
- "t6": "11111"
- }
-
- def encode(self, instruction):
- if "lui" in instruction:
- return self.encode_lui(instruction)
- elif "sw" in instruction:
- return self.encode_sw(instruction)
- elif "addi" in instruction:
- return self.encode_addi(instruction)
- elif "add" in instruction:
- return self.encode_add(instruction)
- else:
- raise ValueError(f"Unsupported instruction: {instruction}")
-
- def encode_lui(self, instruction):
- # 检查语法格式
- if not re.match(r'^lui\s+[a-z0-9]+,\s*0x[a-fA-F0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- rd = self.registers.get(tokens[1].replace(',', ''))
- if rd is None:
- raise ValueError(f"Unknown register: {tokens[1]}")
- imm = bin(int(tokens[2], 16))[2:].zfill(20)
- if len(imm) > 20:
- raise ValueError(f"Immediate value out of bounds: {tokens[2]}")
- opcode = "0110111"
- return imm + rd + opcode
-
- def encode_sw(self, instruction):
- # 检查语法格式
- if not re.match(r'^sw\s+[a-z0-9]+,\s*[0-9]+\([a-z0-9]+\)$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = [token for token in re.split('[ ,()]', instruction) if token]
- if len(tokens) != 4:
- raise ValueError(f"Syntax error in instruction: {instruction}")
- rs2 = self.registers.get(tokens[1])
- rs1 = self.registers.get(tokens[3])
- if rs1 is None or rs2 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- offset = bin(int(tokens[2], 16))[2:].zfill(12)
- if len(offset) > 12:
- raise ValueError(f"Offset out of bounds: {tokens[2]}")
- opcode = "0100011"
- funct3 = "010"
-
- # Split the offset for the instruction format
- imm_11_5 = offset[:7]
- imm_4_0 = offset[7:]
- return imm_11_5 + rs2 + rs1 + funct3 + imm_4_0 + opcode
-
- def encode_addi(self, instruction):
- # 检查语法格式
- if not re.match(r'^addi\s+[a-z0-9]+,\s*[a-z0-9]+,\s*-?[0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- if len(tokens) != 4:
- raise ValueError(f"Syntax error in instruction: {instruction}")
- rd = self.registers.get(tokens[1].replace(',', ''))
- rs1 = self.registers.get(tokens[2].replace(',', ''))
- if rd is None or rs1 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- imm = bin(int(tokens[3]))[2:].zfill(12)
- if len(imm) > 12:
- raise ValueError(f"Immediate value out of bounds: {tokens[3]}")
- opcode = "0010011"
- funct3 = "000"
- return imm + rs1 + funct3 + rd + opcode
-
- def encode_add(self, instruction):
- # 检查语法格式
- if not re.match(r'^add\s+[a-z0-9]+,\s*[a-z0-9]+,\s*[a-z0-9]+$', instruction):
- raise ValueError(f"Syntax error in instruction: {instruction}")
- tokens = instruction.split()
- if len(tokens) != 4:
- raise ValueError(f"Syntax error in instruction: {instruction}")
- rd = self.registers.get(tokens[1].replace(',', ''))
- rs1 = self.registers.get(tokens[2].replace(',', ''))
- rs2 = self.registers.get(tokens[3])
- if rd is None or rs1 is None or rs2 is None:
- raise ValueError(f"Unknown register in instruction: {instruction}")
- opcode = "0110011"
- funct3 = "000"
- funct7 = "0000000"
- return funct7 + rs2 + rs1 + funct3 + rd + opcode
-
- def from_file(self, input_file="source.s"):
- with open(input_file, 'r') as file:
- instructions = file.readlines()
- binary_codes = [self.encode(instruction.strip()) for instruction in instructions]
- with open("binary.txt", 'w') as out_file:
- for code in binary_codes:
- out_file.write(hex(int(code, 2))[2:].zfill(8) + '\n')
-
- # 示例运行
- assembler = Assembler()
- assembler.from_file()
若&a = 0x20000000,&b = 0x20000004,&c = 0x20000008,请手工写出以下语句的汇编代码的机器码:c = a + b + 0x6000(说明:汇编实现方法不唯一,能实现功能即可)。
Lui a4,0x20000 #a
Lw a5,4(a4) #b
Lw a6,8(a4) #c
Add a6,a4,a5
Lui a7,0x6
Add a6,a6,a7
其中,Lui a7,0x6中的0x6就是0x00006,只不过按照老师上上次课讲的,可以把前面的0省略掉。因为0x6000正好是4位十六进制数,刚好超过addi的imm范围,所以需要通过加载lui的上位部分才ok!
机器码(第一行为汇编指令,第二行为机器码格式,第三行为转换的机器码)如下:(a4实际地址为x14,a5a6a7同样如此)
Lui a4, 0x20000 加载上位指令
imm[31:12] rd opcode
0010,0000,0000,0000,0000,0111,0011,0111
Lw a5, 4(a4)
imm[11:0] rs1 funct3 rd opcode
0000,0000,0100,0111,0010,0111,1000,0011
同理:
Lw a6,8(a4)的机器码为:
0000,0000,1000,0111,0010,1000,0000,0011
Add a6,a4,a5 #注意这里rs2是a5,rs1是a4
funct7 rs2 rs1 funct3 rd opcode
0000,0000,1111,0111,0000,1000,0011,0011
Lui a7,0x6
imm[31:12] rd opcode
0000,0000,0000,0000,0110,1000,1000,0011
Add a6,a6,a7
funct7 rs2 rs1 funct3 rd opcode
0000,0001,0001,1000,0000,1000,0011,0011