从 NLTK 语料库中检索句子字符串
Posted
技术标签:
【中文标题】从 NLTK 语料库中检索句子字符串【英文标题】:Retrieving sentence strings from NLTK corpus 【发布时间】:2015-07-22 03:04:03 【问题描述】:这是我的数据集:
emma=gutenberg.sents('austen-emma.txt')
它给了我句子
[[u'she',u'was',u'happy',[u'It',u'was',u'her',u'own',u'good']]
但这正是我想要得到的:
['she was happy','It was her own good']
【问题讨论】:
这是正确的输出吗?不应该是[[u'she',u'was',u'happy',], [u'It',u'was',u'her',u'own',u'good']]]
【参考方案1】:
根据nltk docs,您似乎得到了正确的输出:
sents(fileids=None)[来源]¶ 返回:作为句子或话语列表的给定文件,每个都编码为单词字符串列表。
所以你只需要把你的单词串列表转回一个空格分隔的句子:
sentences = [" ".join(list_of_words) for list_of_words in emma]
【讨论】:
非常感谢!你救了我!【参考方案2】:正如 alvas 和 AShelly 所指出的,您看到的是正确的行为。但是,他们只连接每个句子的单词的方法有两个缺点:
您最终会在标点符号周围出现空格(例如,"Emma Woodhouse , handsome , clever , and rich , with a comfortable [...]"
)。
您让PlaintextCorpusReader
执行句子标记化只是为了在之后恢复它,这是可以避免的计算开销。
鉴于PlaintextCorpusReader
的实现,很容易推导出一个与PlaintextCorpusReader.sents()
完全相同的步骤,但没有句子标记化的函数:
def sentences_from_corpus(corpus, fileids = None):
from nltk.corpus.reader.plaintext import read_blankline_block, concat
def read_sent_block(stream):
sents = []
for para in corpus._para_block_reader(stream):
sents.extend([s.replace('\n', ' ') for s in corpus._sent_tokenizer.tokenize(para)])
return sents
return concat([corpus.CorpusView(path, read_sent_block, encoding=enc)
for (path, enc, fileid)
in corpus.abspaths(fileids, True, True)])
与我上面所说的相反,这个函数执行了一个额外的步骤:由于我们不再进行单词标记化,我们必须用空格替换换行符。
将gutenberg
语料库传递给此函数会导致:
['[Emma by Jane Austen 1816]',
'VOLUME I',
'CHAPTER I',
'Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.',
"She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.",
...]
【讨论】:
【参考方案3】:使用nltk.corpus
API 访问的语料库通常返回一个文档流,即一个句子列表,每个句子都是一个标记列表。
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> emma[0]
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
>>> emma[1]
[u'VOLUME', u'I']
>>> emma[2]
[u'CHAPTER', u'I']
>>> emma[3]
[u'Emma', u'Woodhouse', u',', u'handsome', u',', u'clever', u',', u'and', u'rich', u',', u'with', u'a', u'comfortable', u'home', u'and', u'happy', u'disposition', u',', u'seemed', u'to', u'unite', u'some', u'of', u'the', u'best', u'blessings', u'of', u'existence', u';', u'and', u'had', u'lived', u'nearly', u'twenty', u'-', u'one', u'years', u'in', u'the', u'world', u'with', u'very', u'little', u'to', u'distress', u'or', u'vex', u'her', u'.']
对于nltk.corpus.gutenberg
语料库,它加载PlaintextCorpusReader
,请参阅
https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L114
和https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py
所以它正在读取一个文本文件目录,其中一个是'austen-emma.txt'
,它使用默认的sent_tokenize
和word_tokenize
函数来处理语料库。在代码中它被实例化为tokenizers/punkt/english.pickle
和WordPunctTokenizer()
,见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L40
所以要获得所需的句子字符串列表,请使用:
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> sents_list = [" ".join(sent) for sent in emma]
>>> sents_list[0]
u'[ Emma by Jane Austen 1816 ]'
>>> sents_list[1]
u'VOLUME I'
>>> sents_list[:1]
[u'[ Emma by Jane Austen 1816 ]']
>>> sents_list[:2]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I']
>>> sents_list[:3]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I', u'CHAPTER I']
【讨论】:
非常感谢!你们太棒了!以上是关于从 NLTK 语料库中检索句子字符串的主要内容,如果未能解决你的问题,请参考以下文章