基于上下文无关文法的句子生成算法

Posted 2020-11-19 zhenlingcn

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了基于上下文无关文法的句子生成算法相关的知识，希望对你有一定的参考价值。

前言

算法来自国外大牛的一篇博客：点击此处可查看
算法不涉及任何人工智能领域知识，仅仅是针对上下文无关文法提出的生成句子的思路。

上下文无关文法

基本实现

import random
from collections import defaultdict

class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)  # 默认dict值为list，对于空键值对来说

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')
        """
        prods = rhs.split('|')  # 按照|分割
        for prod in prods:
            self.prod[lhs].append(tuple(prod.split()))  # 默认split按空格进行分割，但是这里的分割是生成一个元组，整体添加到prod里

    def gen_random(self, symbol):
        """ Generate a random sentence from the
            grammar, starting with the given
            symbol.
        """
        sentence = ''

        # select one production of this symbol randomly
        rand_prod = random.choice(self.prod[symbol])  # 从符号列表中随机选择一个词组

        for sym in rand_prod:       #遍历词组中的单词
            # for non-terminals, recurse
            if sym in self.prod:        #如果这个位置的单词并不是一个确切的单词，而是一个词法结构，那么递归选择相应的符合条件的单词
                sentence += self.gen_random(sym)
            else:
                sentence += sym + ' '       #如果已经是一个确切的单词，那么直接连接到句子上即可

        return sentence

cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')

for i in range(10):
    print(cfg1.gen_random('S'))

这里给出了一个基于Python的基本实现，通过递归填充单词即可。

上下文无关文法导致无法终止的问题

解决无法终止问题

破解无法终止的问题，可以采用概率生成算法。
技术分享图片
这里引用了作者原文中的图，由于TERM-EXPR的祖先已经使用过这个表达式，那么此时这个表达式的生成概率会相应地降低，例如图中的降低因子是0.5，也就是说使用过一次，那么下一次使用这个表达式的概率只有原来的50%。
上述算法使用代码实现如下

import random
from collections import defaultdict


# 概率选择算法
def weighted_choice(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i


class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)  # 默认dict值为list，对于空键值对来说

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')
        """
        prods = rhs.split('|')  # 按照|分割
        for prod in prods:
            self.prod[lhs].append(tuple(prod.split()))  # 默认split按空格进行分割，但是这里的分割是生成一个元组，整体添加到prod里

    def gen_random_convergent(self,
                              symbol,
                              cfactor=0.25,
                              pcount=defaultdict(int)
                              ):
        """ Generate a random sentence from the
            grammar, starting with the given symbol.

            Uses a convergent algorithm - productions
            that have already appeared in the
            derivation on each branch have a smaller
            chance to be selected.

            cfactor - controls how tight the
            convergence is. 0 < cfactor < 1.0

            pcount is used internally by the
            recursive calls to pass on the
            productions that have been used in the
            branch.
        """
        sentence = ''

        # The possible productions of this symbol are weighted
        # by their appearance in the branch that has led to this
        # symbol in the derivation
        #
        weights = []
        for prod in self.prod[symbol]:  # 对于满足某个要求的所有表达式，计算相应的生成概率
            if prod in pcount:
                weights.append(cfactor ** (pcount[prod]))  # 对于父节点已经引用过的表达式，此处需要根据因子减小生成概率
            else:
                weights.append(1.0)  #

        rand_prod = self.prod[symbol][weighted_choice(weights)]  # 根据概率选择新生成的表达式

        # pcount is a single object (created in the first call to
        # this method) that's being passed around into recursive
        # calls to count how many times productions have been
        # used.
        # Before recursive calls the count is updated, and after
        # the sentence for this call is ready, it is rolled-back
        # to avoid modifying the parent's pcount.
        #
        pcount[rand_prod] += 1

        for sym in rand_prod:
            # for non-terminals, recurse
            if sym in self.prod:  # 如果不是一个确切的单词，那么递归填充表达式
                sentence += self.gen_random_convergent(
                    sym,
                    cfactor=cfactor,
                    pcount=pcount)
            else:
                sentence += sym + ' '  # 如果是一个确切的单词，那么直接添加到句子后面即可

        # backtracking: clear the modification to pcount
        pcount[rand_prod] -= 1  # 由于pcount是引用传值，因此需要恢复原来状态
        return sentence


cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')

for i in range(10):
    print(cfg1.gen_random_convergent('S'))

cfg2 = CFG()
cfg2.add_prod('EXPR', 'TERM + EXPR')
cfg2.add_prod('EXPR', 'TERM - EXPR')
cfg2.add_prod('EXPR', 'TERM')
cfg2.add_prod('TERM', 'FACTOR * TERM')
cfg2.add_prod('TERM', 'FACTOR / TERM')
cfg2.add_prod('TERM', 'FACTOR')
cfg2.add_prod('FACTOR', 'ID | NUM | ( EXPR )')
cfg2.add_prod('ID', 'x | y | z | w')
cfg2.add_prod('NUM', '0|1|2|3|4|5|6|7|8|9')
for i in range(10):
    print(cfg2.gen_random_convergent('EXPR'))

小结

通过递归，可以很容易地实现基于上下文无关文法生成句子的算法。但是需要注意的是，普通算法会导致无法终止的问题，针对这个问题，有人提出了基于概率的句子生成算法，很好地解决了无法终止的问题。

以上是关于基于上下文无关文法的句子生成算法的主要内容，如果未能解决你的问题，请参考以下文章