使用字典在python中查找带空格的单词？

Posted 2023-03-13

技术标签:

【中文标题】使用字典在python中查找带空格的单词？【英文标题】：fixing words with spaces using a dictionary look up in python? 【发布时间】：2013-11-09 14:30:48 【问题描述】：

我从文档中提取了句子列表。我正在预处理这个句子列表以使其更明智。我面临以下问题

我有"more recen t ly the develop ment, wh ich is a po ten t "之类的句子

我想用查字典来纠正这样的句子吗？删除不需要的空格。

最终输出应该是"more recently the development, which is a potent "

我会假设这是预处理文本中的一项直接任务？我需要一些指针来寻找这种方法。谢谢。

【问题讨论】：

【参考方案1】：

我的index.py 文件是这样的

from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))

我的index.php 文件是这样的

<html>

<head>
  <title>py script</title>
</head>

<body>
  <h1>Hey There!Python Working Successfully In A PHP Page.</h1>
  <?php
    $python = `python index.py`;
    echo $python;
    ?>
</body>

</html>

希望这会奏效

【讨论】：

【参考方案2】：

查看文字或文字segmentation。问题是找到最可能将字符串拆分为一组单词的方法。示例：

 thequickbrownfoxjumpsoverthelazydog

最可能的分割当然应该是：

 the quick brown fox jumps over the lazy dog

这是一篇文章，其中包含使用Google Ngram corpus 的问题的原型源代码：

http://jeremykun.com/2012/01/15/word-segmentation/

该算法起作用的关键是获取有关世界的知识，在这种情况下，某些语言的词频。我在这里实现了文章中描述的算法的一个版本：

https://gist.github.com/miku/7279824

示例用法：

$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']

使用数据，甚至可以重新排序：

$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']

请注意，该算法非常慢 - 它是典型的。

使用 NLTK 的另一种方法：

http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/

至于您的问题，您可以将所有字符串部分连接起来，得到一个字符串并在其上运行分段算法。

【讨论】：

但是当句子可以按多个顺序排列时它是如何工作的呢？ “笔胜于剑” 优雅的方法，但丢弃所有空格会使它变成一个更难的问题。 OPS 描述（“删除不需要的空格”）表明空格永远不会丢失；如果这是正确的，你永远不应该在片段内部寻找分词。 @alexis，你说得对，我想性能至少可以提高一个数量级，只需计算各种连接而不是所有拆分的概率。我可能稍后会回来重新制定我的答案。 @miku：你能分享一下你是如何得到你在python文件中使用的数据的吗？例如count_10M_gb.txt 和 count_1M_gb.txt.gz。我在 Peter Norvig 的网站上找到了另一个 count_1w.txt。谢谢。 @miku：顺便说一句，您的 NLTK 链接是恶意软件。您可能想从答案中删除它。【参考方案3】：

您可以遍历字典以找到最合适的词。找不到匹配项时将单词加在一起。

def iterate(word,dictionary):
   for word in dictionary:
      if words in possibleWord:
        finished_sentence.append(words)
        added = True
      else:
        added = False
      return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
  added,new_word = interate(word,dictionary)
  while True:   
    if added == False:
      word += possible[sentence.find(possibleWord)]
      iterate(word,dictionary)
    else:
      break
  finished_sentence.append(word)

这应该可行。对于变量dictionary，下载每个英文单词的txt file，然后在程序中打开它。

【讨论】：

【参考方案4】：

我建议去掉空格并寻找字典单词来分解它。您可以做一些事情来使其更准确。要使其获得文本中没有空格的第一个单词，请尝试获取整个字符串，然后从一个文件中遍历字典单词（您可以从 http://wordlist.sourceforge.net/ 下载几个这样的文件），首先是最长的，而不是从要分段的字符串的结尾。如果你想让它在一个大字符串上工作，你可以让它自动从后面去掉字母，这样你要查找的第一个单词的字符串就只有最长的字典单词。这应该会导致您找到最长的单词，并使其不太可能执行诸如将“异步”分类为“同步”之类的操作。这是一个示例，它使用原始输入来接收要更正的文本和一个名为 dictionary.txt 的字典文件：

dict = open("dictionary.txt",'r')                                #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip()                                            #strips away spaces
spaced = []                                                      #this is the list of newly broken up words
parsing = True                                                   #this represents when the while loop can end
while parsing:
    if len(words) == 0:                                          #checks if all of the text has been broken into words, if it has been it will end the while loop
        parsing = False
    iterating = True
    for iteration in range(45):                                  #goes through each of the possible word lengths, starting from the biggest
        if iterating == False:
            break
        word = words[:45-iteration]                              #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
        for line in dict:
            line = line[:-1]                                     #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
            if line == word:                                     #this finds if this is the word we are looking for
                spaced.append(word)
                words = words[-(len(word)):]                     #takes away the word from the text list
                iterating = False
                break
print ' '.join(spaced)                                           #prints the output

如果你想让它更准确，你可以尝试使用自然语言解析程序，网上有几个免费的 python 可用。

【讨论】：

【参考方案5】：

--解决方案一：

让我们将句子中的这些块想象成算盘上的珠子，每个珠子都由一个部分字符串组成，珠子可以向左或向右移动以生成排列。每个片段的位置固定在两个相邻片段之间。在当前情况下，珠子将是：

(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)

这解决了 2 个子问题：

a) 珠子是一个单一的单元，所以我们不关心珠子内的排列，即“更多”的排列是不可能的。

b) 珠子的顺序是不变的，只有它们之间的间距会发生变化。即“more”总是在“recen”之前等等。

现在，生成这些珠子的所有排列，输出如下：

morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent

然后根据它们包含的相关词典中的单词数量对这些排列进行评分，最正确的结果可以很容易地被过滤掉。 more recently the development, which is a potent 的得分将高于 morerecentlythedevelop ment, wh ich is a po ten t

做珠子排列部分的代码：

import re

def gen_abacus_perms(frags):
    if len(frags) == 0:
        return []
    if len(frags) == 1:
        return [frags[0]]

    prefix_1 = "01".format(frags[0],frags[1])
    prefix_2 = "0 1".format(frags[0],frags[1])
    if len(frags) == 2:
        nres = [prefix_1,prefix_2]
        return nres

    rem_perms = gen_abacus_perms(frags[2:])
    res = ["01".format(prefix_1, x ) for x in rem_perms] + ["0 1".format(prefix_1, x ) for x in rem_perms] +  \
["01".format(prefix_2, x ) for x in rem_perms] + ["0 1".format(prefix_2 , x ) for x in rem_perms]
    return res



broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))

演示：http://ideone.com/pt4PSt

--解决方案#2：

我会建议一种替代方法，该方法利用已经开发出的文本分析智能，这些智能已经由从事类似问题的人开发，并且曾处理依赖于字典和语法的大型数据语料库。例如搜索引擎。

我不太了解此类公共/付费 api，因此我的示例基于 google 搜索结果。

让我们尝试使用谷歌：

您可以继续将您的无效条款提交给 Google，多次通过，并根据您的查找字典继续评估某些分数的结果。以下是使用 2 遍文本的两个相关输出：

此输出用于第二遍：

这使您将转换为“最近的发展，这是一种强大的”。

要验证转换，您将不得不使用一些相似性算法和评分来过滤掉无效/不太好的结果。

一种原始技术可能是使用 difflib 对标准化字符串进行比较。

>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0

【讨论】：

瓶颈将是最多 100 个查询可以发送到免费的 google api =)【参考方案6】：

您的目标是改进文本，而不一定要使其完美；所以你概述的方法在我看来是有道理的。我会保持简单并使用“贪婪”的方法：从第一个片段开始，只要结果在字典中就可以粘贴；如果结果不是，请吐出到目前为止的内容并从下一个片段重新开始。是的，偶尔你会在the me thod 这样的情况下犯错，所以如果你经常使用它，你可以寻找更复杂的东西。但是，它可能已经足够了。

您需要的主要是一本大字典。如果您经常使用它，我会将其编码为“前缀树”（又名trie），以便您可以快速找出片段是否是真实单词的开头。 nltk 提供了Trie implementation.

由于这种虚假分词不一致，我也会用当前文档中已经处理过的词来扩展我的字典；您可能之前已经看到了完整的单词，但现在它被分解了。

【讨论】：

trie 在这里是一个很好的解决方案，因为您可以检查 recen 之后的 t 是否在其中一个子节点中使用（确实如此），因此，您可以合并“跳过空格”和“查找可能的单词”算法。【参考方案7】：

这里有一些非常基本的东西：

chunks = []
for chunk in my_str.split():
    chunks.append(chunk)
    joined = ''.join(chunks)
    if is_word(joined):
        print joined,
        del chunks[:]

# deal with left overs
if chunks:
    print ''.join(chunks)

我假设您在某处有一组有效词可用于实现is_word。您还必须确保它处理标点符号。这是一种方法：

def is_word(wd):
    if not wd:
        return False
    # Strip of trailing punctuation. There might be stuff in front
    # that you want to strip too, such as open parentheses; this is
    # just to give the idea, not a complete solution.
    if wd[-1] in ',.!?;:':
        wd = wd[:-1]
    return wd in valid_words

【讨论】：

以上是关于使用字典在python中查找带空格的单词？的主要内容，如果未能解决你的问题，请参考以下文章