编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 5.）（python/c/c++版）（笔记）Lexer词法分析程序

Posted 2021-08-17 Dontla

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 5.）（python/c/c++版）（笔记）Lexer词法分析程序相关的知识，希望对你有一定的参考价值。

【编译原理】让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 5.）

文章目录

你如何处理像理解如何创建解释器或编译器这样复杂的事情？一开始，这一切看起来都像是一团乱七八糟的纱线，您需要将其解开才能获得完美的球。

到达那里的方法是将它解开一根线，一次解开一个结。但是，有时您可能会觉得自己没有立即理解某些内容，但您必须继续前进。如果你足够坚持，它最终会“咔哒”一声，我向你保证（哎呀，如果我每次不明白某事时我就留出 25 美分，我很久以前就会变得富有了：）。

在理解如何创建解释器和编译器的过程中，我能给你的最好建议之一就是阅读文章中的解释，阅读代码，然后自己编写代码，甚至将相同的代码编写几个一段时间后，使材料和代码对您感觉自然，然后才能继续学习新主题。不要着急，放慢脚步，花时间深入了解基本思想。这种方法虽然看似缓慢，但会在未来取得成效。相信我。

你最终会得到完美的毛线球。而且，你知道吗？即使它不是那么完美，它仍然比替代方案更好，即什么都不做，不学习主题或快速浏览并在几天内忘记它。

记住 - 继续解开：一个线程，一次一个结，通过编写代码来练习你学到的东西，很多：
在这里插入图片描述
今天，您将使用从本系列前几篇文章中获得的所有知识，并学习如何解析和解释具有任意数量的加法、减法、乘法和除法运算符的算术表达式。您将编写一个解释器，该解释器将能够计算诸如“14 + 2 * 3 - 6 / 2”之类的表达式。

在深入研究和编写一些代码之前，让我们先谈谈运算符的结合性和优先级。

按照惯例，7 + 3 + 1 与 (7 + 3) + 1 相同，并且 7 - 3 - 1 相当于 (7 - 3) - 1。这里没有意外。我们都在某个时候了解到这一点，并且从那时起就认为这是理所当然的。如果我们将 7 - 3 - 1 视为 7 - (3 - 1)，结果将是意外的 5 而不是预期的 3。

在普通算术和大多数编程语言中，加法、减法、乘法和除法都是左结合的：

7 + 3 + 1 等价于 (7 + 3) + 1
7 - 3 - 1 等价于 (7 - 3) - 1
8 * 4 * 2 相当于 (8 * 4) * 2
8 / 4 / 2 等价于 (8 / 4) / 2

运算符的左关联意味着什么？

当表达式 7 + 3 + 1 中像 3 这样的操作数两边都有加号时，我们需要一个约定来决定哪个运算符适用于 3。它是操作数 3 的左侧还是右侧？运算符 +与左侧相关联，因为两边都有加号的操作数属于其左侧的运算符，因此我们说运算符 + 是左结合的。这就是为什么 7 + 3 + 1 根据结合性约定等价于 (7 + 3) + 1 。

好的，对于像 7 + 5 * 2 这样的表达式，我们在操作数 5 的两边都有不同类型的运算符呢？表达式等价于 7 + (5 * 2) 还是 (7 + 5) * 2？我们如何解决这种歧义？

在这种情况下，结合性约定对我们没有帮助，因为它仅适用于一种运算符，即加法 (+, -) 或乘法 (*, /)。当我们在同一表达式中有不同类型的运算符时，我们需要另一种约定来解决歧义。我们需要一个约定来定义运算符的相对优先级。

它是这样的：我们说如果运算符 * 在 + 之前接受其操作数，那么它具有更高的优先级。在我们所知道和使用的算术中，乘法和除法的优先级高于加法和减法。因此，表达式 7 + 5 * 2 等价于 7 + (5 * 2)，而表达式 7 - 8 / 4 等价于 7 - (8 / 4)。

在我们有一个具有相同优先级的运算符的表达式的情况下，我们只使用结合性约定并从左到右执行运算符：

7 + 3 - 1 等价于 (7 + 3) - 1
8 / 4 * 2 相当于 (8 / 4) * 2

我希望你不会认为我想通过谈论运算符的结合性和优先级来让你厌烦。这些约定的好处是我们可以从一个表中构造算术表达式的语法，该表显示算术运算符的结合性和优先级。然后，我们可以按照我在第 4 部分中概述的准则将语法翻译成代码，我们的解释器将能够处理除结合性之外的运算符的优先级。

好的，这是我们的优先级表：
在这里插入图片描述
从表中，您可以看出运算符 + 和 - 具有相同的优先级，并且它们都是左关联的。您还可以看到运算符 * 和 / 也是左结合的，它们之间具有相同的优先级，但比加法和减法运算符具有更高的优先级。

以下是如何从优先级表构造文法的规则：

1、为每个优先级定义一个非终结符。非终结符的产生式主体应包含该级别的算术运算符和下一个更高优先级的非终结符。（啥意思？？）
2、为基本的表达单位创建一个额外的非终结因子，在我们的例子中是整数。一般规则是，如果您有 N 个优先级，则总共需要 N + 1 个非终结符：每个级别一个非终结符加上一个基本表达式单位的非终结符。
向前！

让我们遵循规则并构建我们的文法。

根据规则1，我们将定义两个非端子：非末端称为EXPR 2级和非末端称为术语为1级，并按照规则2我们将定义一个因子非终端的运算单元的基本表达式，整数。

我们新语法的开始符号将是expr并且expr产生式将包含一个主体，表示使用第 2 级的运算符，在我们的例子中是运算符 + 和 - ，并将包含用于下一个更高级别的术语非终结符优先级，级别 1：
在这里插入图片描述
该术语生产将有一个代表从1级，这是操作员使用操作员的身体*和/，并将包含非末端因子表达，整数的基本单位：

非终端因子的产生将是：

您已经在前面的文章中将上述产生式视为语法和语法图的一部分，但在这里我们将它们组合成一个语法来处理结合性和运算符的优先级：
在这里插入图片描述
这是对应于上述语法的语法图：

图中的每个矩形框都是对另一个图的“方法调用”。如果您采用表达式 7 + 5 * 2 并从最上面的图表expr开始，一直走到最底部的图表factor，您应该能够看到下图中的高优先级运算符 * 和 / 在运算符 + 之前执行和 - 在更高的图表中。

为了推动运算符的优先级，让我们看一下根据上面的语法和语法图完成的相同算术表达式 7 + 5 * 2 的分解。这只是表明优先级较高的运算符在优先级较低的运算符之前执行的另一种方式：
在这里插入图片描述
好的，让我们按照第 4 部分的指南将语法转换为代码，看看我们的新解释器是如何工作的，好吗？

这里又是语法：
在这里插入图片描述
这是一个计算器的完整代码，它可以处理包含整数和任意数量的加法、减法、乘法和除法运算符的有效算术表达式。

以下是与第 4 部分中的代码相比的主要变化：

该词法分析器类现在可以记号化+， - ，*，和/（这里没有什么新，我们只是结合的编码从之前的文章中为一类，它支持所有的标记）
回想一下，在语法中定义的每个规则（产生式）R变成了一个同名的方法，并且对该规则的引用变成了一个方法调用：R()。因此，Interpreter类现在具有三个对应于语法中非终结符的方法：expr、term和factor。

python代码

# -*- coding: utf-8 -*-
"""
@File    : calc5.py
@Time    : 2021/7/19 21:23
@Author  : Dontla
@Email   : sxana@qq.com
@Software: PyCharm
"""
# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, MUL, DIV, EOF = (
    'INTEGER', 'PLUS', 'MINUS', 'MUL', 'DIV', 'EOF'
)


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, MUL, DIV, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', '*', '/', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS, '+')
            Token(MUL, '*')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Lexer(object):
    def __init__(self, text):
        # client string input, e.g. "3 * 5", "12 / 3 * 4", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Invalid character')

    def advance(self):
        """Advance the `pos` pointer and set the `current_char` variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens. One token at a time.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')

            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')

            if self.current_char == '*':
                self.advance()
                return Token(MUL, '*')

            if self.current_char == '/':
                self.advance()
                return Token(DIV, '/')

            self.error()

        return Token(EOF, None)


class Interpreter(object):
    def __init__(self, lexer):
        self.lexer = lexer
        # set current token to the first token taken from the input
        self.current_token = self.lexer.get_next_token()

    def error(self):
        raise Exception('Invalid syntax')

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            self.error()

    def factor(self):
        """factor : INTEGER"""
        token = self.current_token
        self.eat(INTEGER)
        return token.value

    def term(self):
        """term : factor ((MUL | DIV) factor)*"""
        result = self.factor()

        while self.current_token.type in (MUL, DIV):
            token = self.current_token
            if token.type == MUL:
                self.eat(MUL)
                result = result * self.factor()
            elif token.type == DIV:
                self.eat(DIV)
                result = result / self.factor()

        return result

    def expr(self):
        """Arithmetic expression parser / interpreter.

        calc>  14 + 2 * 3 - 6 / 2
        17

        expr   : term ((PLUS | MINUS) term)*
        term   : factor ((MUL | DIV) factor)*
        factor : INTEGER
        """
        result = self.term()

        while self.current_token.type in (PLUS, MINUS):
            token = self.current_token
            if token.type == PLUS:
                self.eat(PLUS)
                result = result + self.term()
            elif token.type == MINUS:
                self.eat(MINUS)
                result = result - self.term()

        return result


def main():
    while True:
        try:
            # To run under Python3 replace 'raw_input' call
            # with 'input'
            # text = raw_input('calc> ')
            text = input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        lexer = Lexer(text)
        interpreter = Interpreter(lexer)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

运行结果：

D:\\python_virtualenv\\my_flask\\Scripts\\python.exe C:/Users/Administrator/Desktop/编译原理/python/calc5.py
calc>  3+ 3 *  1 - 4 /2 +1
5.0
calc>

C语言代码

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <string.h>
#include<math.h>

#define flag_digital 0
#define flag_plus 1
#define flag_minus 2
#define flag_multiply 3
#define flag_divide 4

#define flag_EOF 5

struct Token
{
	int type;
	int value;
};

struct Lexer
{
	char* text;
	int pos;
};

struct Interpreter
{
	struct Lexer* lexer;
	struct Token current_token;
};



void error() {
	printf("输入非法！\\n");
	exit(-1);
}

void skip_whitespace(struct Lexer* le) {
	while (le->text[le->pos] == ' ') {
		le->pos++;
	}
}

//判断Interpreter中当前pos是不是数字
int is_integer(char c) {
	if (c >= '0' && c <= '9')
		return 1;
	else
		return 0;
}

void advance(struct Lexer* le) {
	le->pos++;
}

char current_char(struct Lexer* le) {
	return(le->text[le->pos]);
}

//获取数字token的数值（把数字字符数组转换为数字）
int integer(struct Lexer* le) {
	char temp[20];
	int i = 0;
	while (is_integer(le->text[le->pos])) {
		temp[i] = le->text[le->pos];
		i++;
		advance(le);
	}
	int result = 0;
	int j = 0;
	int len = i;
	while (j < len) {
		result += (temp[j] - '0') * pow(10, len - j - 1);
		j++;
	}
	return result;
}

void get_next_token(struct Interpreter* pipt) {
	//先跳空格，再判断有没有结束符
	if (current_char(pipt->lexer) == ' ')
		skip_whitespace(pipt->lexer);
	if (pipt->lexer->pos > (strlen(pipt->lexer->text) - 1)) {
		pipt->current_token = { flag_EOF, NULL };
		return;
	}
	char current = current_char(pipt->lexer);
	if (is_integer(current)) {
		pipt->current_token = { flag_digital, integer(pipt->lexer)};
		return;
	}
	if (current == '+') {
		pipt->current_token = { flag_plus, NULL };
		pipt->lexer->pos++;
		return;
	}
	if (current == '-') {
		pipt->current_token = { flag_minus, NULL };
		pipt->lexer->pos++;
		return;
	}
	if (current == '*') {
		pipt->current_token = { flag_multiply, NULL };
		pipt->lexer->pos++;
		return;
	}
	if (current == '/') {
		pipt->current_token = { flag_divide, NULL };
		pipt->lexer->pos++;
		return;
	}
	error();//如果都不是以上的字符，则报错并退出程序
}



int eat(struct Interpreter* pipt, int type) {
	int current_token_value = pipt->current_token.value;
	if (pipt->current_token.type == type) {
		get_next_token(pipt);
		return current_token_value;
	}
	else {
		error();
	}
}

int factor(struct Interpreter* pipt) {
	return eat(pipt, flag_digital);
}

//判断乘除
int term(struct Interpreter* pipt) {
	int result = factor(pipt);
	while (true) {
		int token_type = pipt->current_token.type;
		if (token_type == flag_multiply) {
			eat(pipt, flag_multiply);
			result = result * factor(pipt);
		}
		else if (token_type == flag_divide) {
			eat(pipt, flag_divide);
			result = result / factor(pipt);
		}
		else {
			return result;
		}
	}
}
int expr(struct Interpreter* pipt) {
	get_next_token(pipt);
	int result = term(pipt);
	while (true) {
		int token_type = pipt->current_token.type;
		if (token_type == flag_plus) {
			eat(pipt, flag_plus);
			result = result + term(pipt);
		}else if (以上是关于编译原理让我们来构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 5.）（python/c/c++版）（笔记）Lexer词法分析程序的主要内容，如果未能解决你的问题，请参考以下文章