在 Python 中扩展英语语言收缩
Posted
技术标签:
【中文标题】在 Python 中扩展英语语言收缩【英文标题】:Expanding English language contractions in Python 【发布时间】:2013-11-16 09:32:47 【问题描述】:英语有a couple of contractions。例如:
you've -> you have
he's -> he is
这些有时会在您进行自然语言处理时引起头痛。是否有 Python 库可以扩展这些收缩?
【问题讨论】:
【参考方案1】:我把***的收缩到扩展页面变成了一个 python 字典(见下文)
请注意,正如您所料,您在查询字典时肯定希望使用双引号:
另外,我在***页面中留下了多个选项。随意修改它。请注意,正确展开的消歧将是一个棘手的问题!
contractions =
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
【讨论】:
如何消除右边部分的歧义?我能得到这个棘手问题的答案吗? @monkey 在他们自己的回答中看到 alko 的评论 实际上我想知道当有多个替换可能时如何解决(例如“I'd”:“I had / I would”)。我是 NLP 领域的新手。 @monkey 这就是评论和链接/文章的内容 @arturomp 您是从哪个***页面开始的?仅供确认:en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions【参考方案2】:上面的答案将非常有效,并且对于模棱两可的收缩可能更好(尽管我认为没有那么多模棱两可的情况)。我会使用更易读、更容易维护的东西:
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won\'t", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.
它可能有一些我没有想到的缺陷。
转自my other answer
【讨论】:
谈论缺陷:作为“真正的”科学 --> 作为领域的科学 @Arun 确实,但单引号只能在双引号内使用。像“她说:'真正的'科学”之类的东西。这是相当罕见的。但如果你碰巧有很多嵌套引号的文本,那么这不是一个好主意。或者,您可以拥有一个仅替换引用块“...”之外的缩略词的 RE。 至少对于美式英语。我认为英式英语使用单引号的频率更高。 另一个缺陷:“这是艾米的房子”->“这是艾米的房子” 我认为如果在模式将其转换为原始字符串之前字符串中存在“r”,则不需要反斜杠(“\”)。【参考方案3】:您不需要库,例如可以使用 reg exp。
>>> import re
>>> contractions_dict =
... 'didn\'t': 'did not',
... 'don\'t': 'do not',
...
>>> contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
... def replace(match):
... return contractions_dict[match.group(0)]
... return contractions_re.sub(replace, s)
...
>>> expand_contractions('You don\'t need a library')
'You do not need a library'
【讨论】:
这是一个好的开始,但我想也有一些极端情况:“杰克的游泳健将”与“杰克的房子不错。”。 @Maarten 消除这些情况和其他情况的工具不会是一个库,而是一个解决方案,它至少包含一个像样的 PoS 标记器和一个高级 nlp 模型,例如 paraller corpora approach here,或 @alko "I'd" 可以扩展为 'I would' 或 'I had'。怎么处理呢? 我没听懂" '(%s)' % '|' “ 部分。那里到底发生了什么? 匹配参数中会传递什么?【参考方案4】:我为此找到了一个库,contractions
它非常简单。
import contractions
print(contractions.fix("you've"))
print(contractions.fix("he's"))
输出:
you have
he is
【讨论】:
您是否检查了这个库中第一个答案中提到的某些复杂的收缩? 值得注意的是,这个库不适用于某些特殊字符,请参阅:github.com/kootenpv/contractions/issues/25 @martin36 感谢您的提醒,但这取决于数据集和任务,在我的情况下,这个答案就是解决方案【参考方案5】:这是一个非常酷且易于使用的库 https://pypi.python.org/pypi/pycontractions/1.0.1.
使用示例(详见链接):
from pycontractions import Contractions
# Load your favorite word2vec model
cont = Contractions('GoogleNews-vectors-negative300.bin')
# optional, prevents loading on first expand_texts call
cont.load_models()
out = list(cont.expand_texts(["I'd like to know how I'd done that!",
"We're going to the zoo and I don't think I'll be home for dinner.",
"Theyre going to the zoo and she'll be home for dinner."], precise=True))
print(out)
您还需要 GoogleNews-vectors-negative300.bin,在上面的 pycontractions 链接中下载链接。 *python3 中的示例代码。
【讨论】:
是的,只要你可以安装它的依赖项之一(语言检查)... 这是一个很酷的项目,可惜目前只有英文版。【参考方案6】:我想在这里对 alko 的回答添加一点内容。如果您查看***,上面提到的英语语言缩写的数量少于 100。当然,在实际情况下,这个数字可能会更多。但是,我很确定 200 到 300 个单词就可以用于英语收缩词。现在,您是否想为那些获得一个单独的库(不过,我认为您正在寻找的东西实际上并不存在)?但是,您可以使用字典和使用正则表达式轻松解决此问题。我建议使用一个不错的标记器 asNatural Language Toolkit,其余的你自己实现应该没有问题。
【讨论】:
我认为这个问题并不比词干更难,而且有几个库可以解决这个问题。是的,很多收缩可以通过简单的搜索和替换来处理,但有些是模棱两可的。最值得注意的是“'s”。【参考方案7】:def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
# contraction_mapping is a dictionary of words having the compact form
contractions_pattern = re.compile('()'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match) \
if contraction_mapping.get(match) \
else contraction_mapping.get(match.lower())
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction
expanded_text = contractions_pattern.sub(expand_match, text)
expanded_text = re.sub("'", "", expanded_text)
return expanded_text
【讨论】:
抱歉格式不对,请告诉我如何更正 不要只用代码发布答案,而是添加一些文字来解释它的作用【参考方案8】:尽管这是一个老问题,但我想我还是回答一下,因为据我所知,仍然没有真正的解决方案。
我不得不在一个相关的 NLP 项目上解决这个问题,我决定解决这个问题,因为这里似乎没有任何东西。有兴趣的可以查看我的expander github repository。
这是一个基于 NLTK、Stanford Core NLP 模型(您必须单独下载)和the dictionary in the previous answer 的优化非常糟糕的程序(我认为)。所有必要的信息都应该在自述文件和大量注释的代码中。我知道带注释的代码是死代码,但这正是我为自己保持清晰而编写的方式。
expander.py
中的示例输入为以下句子:
["I won't let you get away with that", # won't -> will not
"I'm a bad person", # 'm -> am
"It's his cat anyway", # 's -> is
"It's not what you think", # 's -> is
"It's a man's world", # 's -> is and 's possessive
"Catherine's been thinking about it", # 's -> has
"It'll be done", # 'll -> will
"Who'd've thought!", # 'd -> would, 've -> have
"She said she'd go.", # she'd -> she would
"She said she'd gone.", # she'd -> had
"Y'all'd've a great time, wouldn't it be so cold!", # Y'all'd've -> You all would have, wouldn't -> would not
" My name is Jack.", # No replacements.
"'Tis questionable whether Ma'am should be going.", # 'Tis -> it is, Ma'am -> madam
"As history tells, 'twas the night before Christmas.", # 'Twas -> It was
"Martha, Peter and Christine've been indulging in a menage-à-trois."] # 've -> have
输出的目标
["I will not let you get away with that",
"I am a bad person",
"It is his cat anyway",
"It is not what you think",
"It is a man's world",
"Catherine has been thinking about it",
"It will be done",
"Who would have thought!",
"She said she would go.",
"She said she had gone.",
"You all would have a great time, would not it be so cold!",
"My name is Jack.",
"It is questionable whether Madam should be going.",
"As history tells, it was the night before Christmas.",
"Martha, Peter and Christine have been indulging in a menage-à-trois."]
所以对于这一小组测试语句,我想出了一些边缘情况来测试,效果很好。
由于这个项目现在已经失去了重要性,我不再积极开发它了。对此项目的任何帮助将不胜感激。要做的事情都写在 TODO 列表中。或者,如果您对如何改进我的 python 有任何提示,我也将非常感谢。
【讨论】:
谢谢亚尼克。我有个疑问。如果你上面的句子是'I'm a bad person'
这样的格式。你的方法不适用。
好吧,只要 nltk tokenize 可以将其拆分为单词,一切都应该没问题,但其他答案可能会提供更好的解决方案。以上是关于在 Python 中扩展英语语言收缩的主要内容,如果未能解决你的问题,请参考以下文章
扩展/收缩 UITableViewCell 高度不再适用于 iOS7