使用正则表达式使用单词“but”来分块句子

Posted

技术标签:

【中文标题】使用正则表达式使用单词“but”来分块句子【英文标题】:Chunking sentences using the word 'but' with RegEx 【发布时间】:2019-01-31 12:37:46 【问题描述】:

我正在尝试在单词“but”(或任何其他并列连词)处使用 RegEx 对句子进行分块。它不工作......

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: .*<CC>.*')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

我需要将句子"There are no large collections present but there is spinal canal stenosis."分成两部分:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

我还希望使用相同的代码在“and”和其他并列连词 (CC) 词处拆分句子。但是我的代码不起作用。请帮忙。

【问题讨论】:

【参考方案1】:

我认为你可以简单地做

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

在哪里

`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
)
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

您可以在其中添加更多连词,以竖线字符| 分隔。 请注意,这些单词不包含在正则表达式中具有特殊含义的字符。如果有疑问,请先使用re.escape(word) 避开​​它们

【讨论】:

【参考方案2】:

如果您想避免硬编码“but”和“and”等连词,请尝试将 chinking 与 chunking 一起使用:


import nltk
Digdug = nltk.RegexpParser(r""" 
CHUNK_AND_CHINK:
<.*>+          # Chunk everything
<CC>+      # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))

result = Digdug.parse(sentence)

for subtree in result.subtrees(filter=lambda t: t.label() == 
'CHUNK_AND_CHINK'):
            print (subtree)

Chinking 基本上排除了我们不需要的词组短语——在这种情况下是“但是”。 更多详情请见:http://www.nltk.org/book/ch07.html

【讨论】:

以上是关于使用正则表达式使用单词“but”来分块句子的主要内容,如果未能解决你的问题,请参考以下文章

如何使用正则表达式仅捕获具有特定格式的有效句子的第一个单词? [复制]

正则表达式检查句子中仅包含字母的两个单词

正则表达式 - 使用否定环视匹配同句中的单词

正则表达式在 Python 中拆分单词

使用 CountVectorizer 的无空格 unicode 句子的正则表达式

使用正则表达式分隔单个单词?