从python中的推文中提取n-gram
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从python中的推文中提取n-gram相关的知识,希望对你有一定的参考价值。
假设我有100条推文。 在这些推文中,我需要提取:1)食品名称,以及2)饮料名称。
推文示例:
“昨天我有一个可口可乐,还有一个热狗吃午饭,还有一些bana为沙漠分开。我喜欢可乐,但是香蕉分开的甜点中的香蕉很成熟”
我必须处理两个词典。一个有食物名称,一个有饮料名称。
食物名称词典中的示例: “热狗” “香蕉” “香蕉船”
饮料名称词典中的示例: “可乐” “可乐” “可口可乐”
我应该能够提取的内容:
[[[“可口可乐”,“饮料”],[“热狗”,“食物”],[“香蕉分开”,“食物”], [[“可乐”,“饮料”],[“香蕉”,“食物”],[“香蕉分开”,“食物”]]]
词典中的名称可以是1-5个字长。如何使用我的词典从推文中提取n-gram?
答案
一个简单的解决方案:
import re
def lexicon_by_word(lexicons):
return {word:key for key in lexicons.keys() for word in lexicons[key]}
def split_sentences(st):
sentences = re.split(r'[.?!]s*', st)
if sentences[-1]:
return sentences
else:
return sentences[:-1]
def ngrams_finder(lexicons, text):
lexicons_by_word = lexicon_by_word(lexicons)
def pattern(lexicons):
pattern = "|".join(lexicons_by_word.keys())
pattern = re.compile(pattern)
return pattern
pattern = pattern(lexicons)
ngrams = []
for sentence in split_sentences(text):
try:
ngram = []
for result in pattern.findall(sentence):
ngram.append([result, lexicons_by_word[result]])
ngrams.append(ngram)
except IndexError: #if re.findall does not find anything
continue
return ngrams
# You could customize it
text = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"
lexicons = {
"food":["hot dog",
"banana",
"banana split"],
"beverage":["coke",
"cola",
"coca cola"],
}
print(ngrams_finder(lexicons, text))
split_sentences函数取自这里:Splitting a sentence by ending characters
另一答案
不确定你到目前为止尝试了什么,下面是在ngrams
和nltk
使用dict()
的解决方案
from nltk import ngrams
tweet = "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe"
# Your lexicons
lexicon_food = ["hot dog", "banana", "banana split"]
lexicon_beverage = ["coke", "cola", "coca cola"]
lexicon_dict = {x: [x, 'Food'] for x in lexicon_food}
lexicon_dict.update({x: [x, 'Beverage'] for x in lexicon_beverage})
# Function to extract lexicon items
def extract(g, lex):
if ' '.join(g) in lex.keys():
return lex.get(' '.join(g))
elif g[0] in lex.keys():
return lex.get(g[0])
else:
pass
# Your task
out = [[extract(g, lexicon_dict) for g in ngrams(sentence.split(), 2) if extract(g, lexicon_dict)]
for sentence in tweet.replace(',', '').lower().split('.')]
print(out)
输出:
[[['coca cola', 'Beverage'], ['cola', 'Beverage'], ['hot dog', 'Food']],
[['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]
方法2(避免“可口可乐”和“可乐”)
def extract2(sentence, lex):
extracted_words = []
words = sentence.split()
i = 0
while i < len(words):
if ' '.join(words[i:i+2]) in lex.keys():
extracted_words.append(lex.get(' '.join(words[i:i+2])))
i += 2
elif words[i] in lex.keys():
extracted_words.append(lex.get(words[i]))
i += 1
else:
i += 1
return extracted_words
out = [extract2(s, lexicon_dict) for s in tweet.replace(',', '').lower().split('.')]
print(out)
输出:
[[['coca cola', 'Beverage'], ['hot dog', 'Food']],
[['coke', 'Beverage'], ['banana', 'Food'], ['banana split', 'Food']]]
注意到这里不需要nltk
。
以上是关于从python中的推文中提取n-gram的主要内容,如果未能解决你的问题,请参考以下文章