python中文本的n-gram
Posted
技术标签:
【中文标题】python中文本的n-gram【英文标题】:n-grams from text in python 【发布时间】:2018-08-12 00:20:56 【问题描述】:对我之前的post 的更新,有一些变化: 假设我有 100 条推文。 在这些推文中,我需要提取:1)食物名称和 2)饮料名称。我还需要为每次提取附加类型(饮料或食物)和一个 ID 号(每个项目都有一个唯一的 ID)。 我已经有一个包含名称、类型和 ID 号的词典:
lexicon =
'dr pepper': 'type': 'drink', 'id': 'd_123',
'coca cola': 'type': 'drink', 'id': 'd_234',
'cola': 'type': 'drink', 'id': 'd_345',
'banana': 'type': 'food', 'id': 'f_456',
'banana split': 'type': 'food', 'id': 'f_567',
'cream': 'type': 'food', 'id': 'f_678',
'ice cream': 'type': 'food', 'id': 'f_789'
推文示例:
经过对“tweet_1”的各种处理,我有这样的句子:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
我请求的输出(可以是 list 以外的 type):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
重要的是输出应该不在 ngrams (n>1) 中提取 unigrams:
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
理想情况下,我希望能够在各种 nltk 过滤器中运行我的句子,例如 lemmatize() 和 pos_tag() BEFORE 提取以获得如下输出。但是使用这个正则表达式解决方案,如果我这样做,那么所有单词都会被拆分为 unigram,或者它们将从字符串“coca cola”中生成 1 个 unigram 和 1 个 bigram,这将生成我不想拥有的输出(如上例)。 理想的输出(同样输出的类型并不重要):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]
【问题讨论】:
***.com/questions/49064114/… 的副本? 不重复,但非常相似 【参考方案1】:可能不是最有效的解决方案,但这肯定会让你开始 -
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon =
'dr pepper': 'type': 'drink', 'id': 'd_123',
'coca cola': 'type': 'drink', 'id': 'd_234',
'cola': 'type': 'drink', 'id': 'd_345',
'banana': 'type': 'food', 'id': 'f_456',
'banana split': 'type': 'food', 'id': 'f_567',
'cream': 'type': 'food', 'id': 'f_678',
'ice cream': 'type': 'food', 'id': 'f_789'
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append(lex: list(lexicon[lex].values()) )
sentence = sentence.replace(lex, '')
print(chunks)
输出
['dr pepper': ['drink', 'd_123'], 'coca cola': ['drink', 'd_234'], 'banana split': ['food', 'f_567'], 'ice cream': ['food', 'f_789'], 'coca cola': ['drink', 'd_234'], 'banana': ['food', 'f_456']]
说明
lexicon_list = list(lexicon.keys())
获取需要搜索的短语列表并按长度对其进行排序(以便首先找到更大的块)
输出是dict
的列表,其中每个字典都有list
值。
【讨论】:
【参考方案2】:不幸的是,由于我的声誉低下,我无法制作 cmets,但 Vivek 的答案可以通过 1) 正则表达式、2) 包括 pos_tag 标记作为 NN、3) 字典结构来改进,您可以在其中通过推文选择推文结果:
import re
import nltk
from collections import OrderedDict
tweets = "tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']
lexicon =
'dr pepper': 'type': 'drink', 'id': 'd_123',
'coca cola': 'type': 'drink', 'id': 'd_234',
'cola': 'type': 'drink', 'id': 'd_345',
'banana': 'type': 'food', 'id': 'f_456',
'banana split': 'type': 'food', 'id': 'f_567',
'cream': 'type': 'food', 'id': 'f_678',
'ice cream': 'type': 'food', 'id': 'f_789'
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) + ")"
pattern = re.compile(pattern)
# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = word:nltk.pos_tag(word) for word in lexicon_list
#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
chunks[tweet] = []
for sentence in tweets[tweet]:
temp = OrderedDict()
for word in pattern.findall(sentence):
temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
chunks[tweet].append((temp))
最终输出为:
OrderedDict([('tweet_1',
[OrderedDict([('dr pepper',
[[('dr', 'NN'), ('pepper', 'NN')],
['drink', 'd_123']]),
('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana split',
[[('banana', 'NN'), ('split', 'NN')],
['food', 'f_567']]),
('ice cream',
[[('ice', 'NN'), ('cream', 'NN')],
['food', 'f_789']])]),
OrderedDict([('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana',
[[('banana', 'NN')], ['food', 'f_456']])])])])
【讨论】:
感谢您的回复。但是,pos_tag 的重点并不是说每个“香蕉”都应该是 NN,而是在预训练模型中只找到那些属于 NN 类型的香蕉。 当然,但是正如我在 lexicon_pos_tag 上面的评论中指出的那样...如果您在训练 pos_tag 模型后执行上述代码,则代码:lexicon_pos_tag = word:nltk.pos_tag(word) for word in lexicon_list
将创建一个像 "banana split “:(“香蕉分裂”,“NN”)。然后将在代码temp[word] = [lexicon_pos_tag[word],...
中正确使用。
谢谢!目前,我正在研究您原来的正则表达式解决方案。但我也会尝试更新!非常好的输入! :)【参考方案3】:
我会用一个 for 循环来过滤 ..
使用 if 语句在键中查找字符串。如果您希望包含 unigram,请删除
len(key.split()) > 1
如果您只想包含一元组,请将其更改为:
len(key.split()) == 1
filtered_list = ['tweet_id_1']
for k, v in lexicon.items():
for s in sentences:
if k in s and len(k.split()) > 1:
filtered_list.extend((k, v))
print(filtered_list)
【讨论】:
这不会在第二句中找到“香蕉”。它应该即将检测到所有 ngram,但不会生成相同字符串的重复。以上是关于python中文本的n-gram的主要内容,如果未能解决你的问题,请参考以下文章