NLTK 关系提取不返回任何内容
Posted
技术标签:
【中文标题】NLTK 关系提取不返回任何内容【英文标题】:NLTK relation extraction returns nothing 【发布时间】:2017-03-21 17:18:07 【问题描述】:我最近正在研究使用 nltk 从文本中提取关系。所以我建立了一个示例文本:“汤姆是微软的联合创始人。”并使用以下程序测试并返回任何内容。我不知道为什么。
我使用的是 NLTK 版本:3.2.1,python 版本:3.5.2。
这是我的代码:
import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize
def test():
with open('sample.txt', 'r') as f:
sample = f.read() # "Tom is the cofounder of Microsoft"
sentences = sent_tokenize(sample)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)
for rel in rels:
print('0:<51'.format(i, rtuple(rel)))
if __name__ == '__main__':
test()
1。经过一些调试,如果发现当我将输入更改为
“盖茨于 1955 年 10 月 28 日出生于华盛顿州西雅图。”
nltk.chunk.ne_chunk() 输出为:
(S (人门/NNS) 是/VBD 出生/VBN 输入/输入 (GPE 西雅图/NNP) ,/, (GPE华盛顿/NNP) 开/IN 十月/NNP 28张/CD ,/, 1955/CD ./.)
test() 返回:
[PER: 'Gates/NNS'] '是/VBD 出生/VBN 在/IN' [GPE: 'Seattle/NNP']
2。在我将输入更改为:
“盖茨于 1955 年 10 月 28 日出生在西雅图。”
test() 没有返回任何内容。
3。我深入研究了 nltk/sem/relextract.py 发现这很奇怪
输出是由函数引起的: semi_rel2reldict(pairs, window=5, trace=False),它只在len(pairs) > 2时返回结果,这就是为什么当一个句子少于三个NE时会返回None。
这是一个错误还是我以错误的方式使用了 NLTK?
【问题讨论】:
"pairs" insemi_rel2reldict
不一定是网元。检查 tree2semi_rel
也在 relextract 中。深入挖掘,你会发现原因 =)
顺便说一句,你的 NE 类使用 'PERSON'
和 'ORGANIZATION'
而不是“PER”和“ORG”,因为 ACE 类是 github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L30
另外,您的句子在 NE 标记时没有 ORGANZATION,因此您的模式将不匹配。
@alvas,对不起,“ORG”应该改为“GPE”,但问题是存在。如果句子包含的 NE 少于三个,extract_rels() 将不会返回任何结果。
是的,是的,您走在正确的轨道上。深入挖掘。查看 tree2semi_rel
并尝试了解它的作用 =) 此外,“PER”可能不匹配,因为使用 ACE 标签训练的 ne_chunk
是“PERSON”,请参阅 github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L30
【参考方案1】:
首先,用ne_chunk
分块NE,成语看起来像这样
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')])])
(另见https://***.com/a/31838373/610569)
接下来我们看看extract_rels
function。
def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
"""
当你调用这个函数时:
extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)
它依次执行 4 个进程。
1。它会检查您的 subjclass
和 objclass
是否有效
即https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202:
if subjclass and subjclass not in NE_CLASSES[corpus]:
if _expand(subjclass) in NE_CLASSES[corpus]:
subjclass = _expand(subjclass)
else:
raise ValueError("your value for the subject type has not been recognized: %s" % subjclass)
if objclass and objclass not in NE_CLASSES[corpus]:
if _expand(objclass) in NE_CLASSES[corpus]:
objclass = _expand(objclass)
else:
raise ValueError("your value for the object type has not been recognized: %s" % objclass)
2。它从您的 NE 标记输入中提取“对”:
if corpus == 'ace' or corpus == 'conll2002':
pairs = tree2semi_rel(doc)
elif corpus == 'ieer':
pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline)
else:
raise ValueError("corpus type not recognized")
现在让我们看看给定你输入的句子Tom is the cofounder of Microsoft
,tree2semi_rel()
返回什么:
>>> from nltk.sem.relextract import tree2semi_rel, semi_rel2reldict
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
所以它返回一个包含 2 个列表的列表,第一个内部列表由一个空白列表和包含“PERSON”标签的 Tree
组成。
[[], Tree('PERSON', [('Tom', 'NNP')])]
第二个列表由短语is the cofounder of
和包含“ORGANIZATION”的Tree
组成。
让我们继续。
3。 extract_rel
然后尝试将这些对更改为某种关系字典
reldicts = semi_rel2reldict(pairs)
如果我们查看 semi_rel2reldict
函数与您的例句返回的内容,我们会看到这是空列表返回的地方:
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[]
那么我们来看看semi_rel2reldict
https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144的代码:
def semi_rel2reldict(pairs, window=5, trace=False):
"""
Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
stores information about the subject and object NEs plus the filler between them.
Additionally, a left and right context of length =< window are captured (within
a given input sentence).
:param pairs: a pair of list(str) and ``Tree``, as generated by
:param window: a threshold for the number of items to include in the left and right context
:type window: int
:return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
:rtype: list(defaultdict)
"""
result = []
while len(pairs) > 2:
reldict = defaultdict(str)
reldict['lcon'] = _join(pairs[0][0][-window:])
reldict['subjclass'] = pairs[0][1].label()
reldict['subjtext'] = _join(pairs[0][1].leaves())
reldict['subjsym'] = list2sym(pairs[0][1].leaves())
reldict['filler'] = _join(pairs[1][0])
reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
reldict['objclass'] = pairs[1][1].label()
reldict['objtext'] = _join(pairs[1][1].leaves())
reldict['objsym'] = list2sym(pairs[1][1].leaves())
reldict['rcon'] = _join(pairs[2][0][:window])
if trace:
print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
result.append(reldict)
pairs = pairs[1:]
return result
semi_rel2reldict()
做的第一件事是检查tree2semi_rel()
的输出中哪里有两个以上的元素,而你的例句没有:
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> len(tree2semi_rel(chunked))
2
>>> len(tree2semi_rel(chunked)) > 2
False
啊哈,这就是extract_rel
什么都不返回的原因。
现在的问题是如何让extract_rel()
即使有来自tree2semi_rel()
的两个元素也能返回一些东西?这可能吗?
让我们尝试一个不同的句子:
>>> text = "Tom is the cofounder of Microsoft and now he is the founder of Marcohard"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')]), ('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN'), Tree('PERSON', [('Marcohard', 'NNP')])])
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])], [[('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN')], Tree('PERSON', [('Marcohard', 'NNP')])]]
>>> len(tree2semi_rel(chunked)) > 2
True
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, 'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': 'and/CC now/RB he/PRP is/VBZ the/DT', 'subjtext': 'Tom/NNP')]
但这只能确认extract_rel
在tree2semi_rel
返回while len(pairs) > 2 的条件会怎样?
为什么我们不能while len(pairs) > 1
?
如果我们仔细查看代码,我们会看到填充 reldict 的最后一行 https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169:
reldict['rcon'] = _join(pairs[2][0][:window])
它尝试访问pairs
的第三个元素,如果pairs
的长度为2,您将获得IndexError
。
那么如果我们删除 rcon
键并简单地将其更改为 while len(pairs) >= 2
会发生什么?
为此,我们必须重写 semi_rel2redict()
函数:
>>> from nltk.sem.relextract import _join, list2sym
>>> from collections import defaultdict
>>> def semi_rel2reldict(pairs, window=5, trace=False):
... """
... Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which
... stores information about the subject and object NEs plus the filler between them.
... Additionally, a left and right context of length =< window are captured (within
... a given input sentence).
... :param pairs: a pair of list(str) and ``Tree``, as generated by
... :param window: a threshold for the number of items to include in the left and right context
... :type window: int
... :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon'
... :rtype: list(defaultdict)
... """
... result = []
... while len(pairs) >= 2:
... reldict = defaultdict(str)
... reldict['lcon'] = _join(pairs[0][0][-window:])
... reldict['subjclass'] = pairs[0][1].label()
... reldict['subjtext'] = _join(pairs[0][1].leaves())
... reldict['subjsym'] = list2sym(pairs[0][1].leaves())
... reldict['filler'] = _join(pairs[1][0])
... reldict['untagged_filler'] = _join(pairs[1][0], untag=True)
... reldict['objclass'] = pairs[1][1].label()
... reldict['objtext'] = _join(pairs[1][1].leaves())
... reldict['objsym'] = list2sym(pairs[1][1].leaves())
... reldict['rcon'] = []
... if trace:
... print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass']))
... result.append(reldict)
... pairs = pairs[1:]
... return result
...
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, 'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP')]
啊!它有效,但extract_rels()
中还有第四步。
4。给定您提供给pattern
参数https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222 的正则表达式,它会执行一个过滤器:
relfilter = lambda x: (x['subjclass'] == subjclass and
len(x['filler'].split()) <= window and
pattern.match(x['filler']) and
x['objclass'] == objclass)
现在让我们试试semi_rel2reldict
的破解版:
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, 'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP')]
>>>
>>> pattern = re.compile(r'.*\bof\b.*')
>>> reldicts = semi_rel2reldict(tree2semi_rel(chunked))
>>> relfilter = lambda x: (x['subjclass'] == subjclass and
... len(x['filler'].split()) <= window and
... pattern.match(x['filler']) and
... x['objclass'] == objclass)
>>> relfilter
<function <lambda> at 0x112e591b8>
>>> subjclass = 'PERSON'
>>> objclass = 'ORGANIZATION'
>>> window = 5
>>> list(filter(relfilter, reldicts))
[defaultdict(<type 'str'>, 'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP')]
有效!现在让我们以元组形式查看它:
>>> from nltk.sem.relextract import rtuple
>>> rels = list(filter(relfilter, reldicts))
>>> for rel in rels:
... print rtuple(rel)
...
[PER: 'Tom/NNP'] 'is/VBZ the/DT cofounder/NN of/IN' [ORG: 'Microsoft/NNP']
【讨论】:
感谢 alvas 的精彩回答! 如何获得包含多个子类和 objclasses 的结果?以上是关于NLTK 关系提取不返回任何内容的主要内容,如果未能解决你的问题,请参考以下文章
nltk包返回TypeError:'LazyCorpusLoader'对象不可调用