编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 1.）（python/c/c++版）（笔记）

Posted 2021-08-03 Dontla

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 1.）（python/c/c++版）（笔记）相关的知识，希望对你有一定的参考价值。

原文：Let’s Build A Simple Interpreter. Part 1.

文章目录

- 【编译原理】让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 1.）（python/c/c++版）（part 1）
- - 将输入字符串分解为token标记的过程称为词法分析（lexical analysis），词法分析器（ lexical analyzer或 lexer ）也叫扫描器（scanner ）或标记器（tokenizer）。
  - 用c/c++实现

【编译原理】让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 1.）（python/c/c++版）（part 1）

pascal代码，我们要为这种代码做一个解释器

program factorial;

function factorial(n: integer): longint;
begin
    if n = 0 then
        factorial := 1
    else
        factorial := n * factorial(n - 1);
end;

var
    n: integer;

begin
    for n := 0 to 16 do
        writeln(n, '! = ', factorial(n));
end.

首先我们先来实现pascal编译器的一个小功能，两个数的加法，我们准备用python实现，但是如果你想用其他语言实现也是可行的，这是它的代码：

# -*- coding: utf-8 -*-
"""
@File    : 1.py
@Time    : 2021/5/19 14:44
@Author  : Dontla
@Email   : sxana@qq.com
@Software: PyCharm
"""
# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, EOF = 'INTEGER', 'PLUS', 'EOF'


class Token(object):
    def __init__(self, type_, value):
        # token type: INTEGER, PLUS, or EOF
        self.type = type_
        # token value: 0, 1, 2. 3, 4, 5, 6, 7, 8, 9, '+', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Interpreter(object):
    def __init__(self, text):
        # 客户端字符串输入, 比如 "3+5"
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance（当前标记实例）
        self.current_token = None

    def error(self):
        raise Exception('Error parsing input')  # 语法分析输入出错

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens. One token at a time.

        词法分析器（也称为扫描器scanner或标记器tokenizer）
        这个方法负责将一个句子分解成标记tokens。一次一个标记
        """
        text = self.text

        # is self.pos index past the end of the self.text ?
        # if so, then return EOF token because there is no more
        # input left to convert into tokens
        # self.pos索引是否超过self.text的结尾？
        # 如果是，则返回EOF标记，因为没有更多的标记
        # 向左输入以转换为标记
        if self.pos > len(text) - 1:
            return Token(EOF, None)

        # get a character at the position self.pos and decide
        # what token to create based on the single character
        # 在self.pos位置获取一个字符，并根据单个字符决定要创建的标记
        current_char = text[self.pos]

        # if the character is a digit then convert it to
        # integer, create an INTEGER token, increment self.pos
        # index to point to the next character after the digit,
        # and return the INTEGER token
        # 如果字符是数字，则将其转换为整型，创建整型标记，增加self.pos索引以指向数字后面的下一个字符，然后返回整型标记
        if current_char.isdigit():  # isdigit()函数，全是数字返回True，否则返回False
            token = Token(INTEGER, int(current_char))  # 创建一个token
            self.pos += 1
            return token

        if current_char == '+':
            token = Token(PLUS, current_char)
            self.pos += 1
            return token

        self.error()

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()

    def expr(self):
        """expr -> INTEGER PLUS INTEGER"""
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()

        # we expect the current token to be a single-digit integer
        left = self.current_token
        self.eat(INTEGER)

        # we expect the current token to be a '+' token
        op = self.current_token
        self.eat(PLUS)

        # we expect the current token to be a single-digit integer
        right = self.current_token
        self.eat(INTEGER)
        # after the above call the self.current_token is set to
        # EOF token

        # at this point INTEGER PLUS INTEGER sequence of tokens
        # has been successfully found and the method can just
        # return the result of adding two integers, thus
        # effectively interpreting client input
        result = left.value + right.value
        return result


def main():
    while True:
        try:
            # 要在Python3下运行，请将“raw_input”调用替换为“input”

            # text = raw_input('calc> ')
            text = input('calc> ')  # 获取键盘输入，参数为提示信息
        except EOFError:  # 不知是什么异常
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

运行结果：

D:\\python_virtualenv\\my_flask\\Scripts\\python.exe C:/Users/Administrator/Desktop/新建文件夹/1.py
calc> 1+2
3
calc>

将输入字符串分解为token标记的过程称为词法分析（lexical analysis），词法分析器（ lexical analyzer或 lexer ）也叫扫描器（scanner ）或标记器（tokenizer）。

>>> from calc1 import Interpreter
>>>
>>> interpreter = Interpreter('3+5')
>>> interpreter.get_next_token()
Token(INTEGER, 3)
>>>
>>> interpreter.get_next_token()
Token(PLUS, '+')
>>>
>>> interpreter.get_next_token()
Token(INTEGER, 5)
>>>
>>> interpreter.get_next_token()
Token(EOF, None)
>>>

让我们回顾一下您的解释器如何评估算术表达式：

解释器接受一个输入字符串，比如说“3+5”
解释器调用expr方法在词法分析器get_next_token返回的标记流中查找结构。它试图找到的结构是INTEGER PLUS INTEGER的形式。在确认结构后，它通过添加两个INTEGER标记的值来解释输入，因为此时解释器很清楚它需要做的是添加两个整数 3 和 5。

检查理解：
什么是解释器interpreter？
什么是编译器compiler？
解释器和编译器有什么区别？
什么是标记token？
将输入分解为标记的过程的名称是什么？
进行词法分析lexical analysis的解释器的部分是什么？
解释器或编译器的那部分的其他常见名称是什么？

用c/c++实现

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <string.h>

struct Interpreter 
{
	char* text;
	int pos;
	struct Token (*get_next_token)(struct Interpreter*);
};

struct Token
{
	int type;
	char value;
};

struct Token get_next_token(struct Interpreter* pipt) {
	if (pipt->pos > (strlen(pipt->text)-1)) {
		struct Token token = {3, '\\0'};//3表示EOF，2表示+，1表示数字
		return token;
	}
	char current_char = pipt->text[pipt->pos];
	if (current_char>='0' && current_char<='9') {
		struct Token token = {1, current_char};
		pipt->pos++;
		return token;
	}
	if (current_char == '+') {
		struct Token token = { 2, current_char };
		pipt->pos++;
		return token;
	}
	printf("输入非法！\\n");
	exit(-1);//如果都不是以上的字符，则报错并退出程序
}

char eat(struct Token* pcurrent_token, struct Interpreter* pipt, int type) {
	char former_token_value = pcurrent_token->value;
	if (pcurrent_token->type == type) {
		*pcurrent_token = pipt->get_next_token(pipt);
	}
	else {
		printf("输入非法！\\n");
		exit(-1);
	}
	return former_token_value;
}

int expr(char* text) {
	struct Interpreter ipt = {text, 0, get_next_token};
	struct Token current_token = ipt.get_next_token(&ipt);
	char temp;
	temp = eat(&current_token, &ipt, 1);//断言第一个字符是数字
	int left = temp - '0';
	eat(&current_token, &ipt, 2);//断言第三个字符是加号
	temp = eat(&current_token, &ipt, 1);//断言第三个字符是数字
	int right = temp - '0';
	int result = left + right;
	return result;

}

int main() {
	char text[10];
	while (1)
	{
		printf("请输入算式：\\n");
		scanf_s("%s", text, sizeof(text));
		int result = expr(text);
		printf("= %d\\n\\n", result);
	}
	return 0;
}

运行结果：

请输入算式：
2+8
= 10

请输入算式：
1+5
= 6

请输入算式：
3+4
= 7

请输入算式：

以上是关于编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 1.）（python/c/c++版）（笔记）的主要内容，如果未能解决你的问题，请参考以下文章