在 NLTK 中使用我自己的语料库而不是 movie_reviews 语料库进行分类

Posted

技术标签:

【中文标题】在 NLTK 中使用我自己的语料库而不是 movie_reviews 语料库进行分类【英文标题】:Using my own corpus instead of movie_reviews corpus for Classification in NLTK 【发布时间】:2015-05-30 07:13:59 【问题描述】:

我使用下面的代码,我从Classification using movie review corpus in NLTK/Python得到它

import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [(i:(i in tokens) for i in word_features, tag) for tokens,tag in documents[:numtrain]]
test_set = [(i:(i in tokens) for i in word_features, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

输出:

0.655
Most Informative Features
                 bad = True              neg : pos    =      2.0 : 1.0
              script = True              neg : pos    =      1.5 : 1.0
               world = True              pos : neg    =      1.5 : 1.0
             nothing = True              neg : pos    =      1.5 : 1.0
                 bad = False             pos : neg    =      1.5 : 1.0

我想在 nltk 中创建自己的文件夹而不是 movie_reviews,并将自己的文件放入其中。

【问题讨论】:

您的文件夹是什么样的?您可以发布文件夹中文件的sn-p吗?还是指向您的数据集的链接? 它与movie_reviews 文件夹完全相同,具有posneg 文件夹。但我自己选择.txt文件的内容 希望答案对您有帮助 @alvas 是的,很有帮助。谢谢。请您回答这个问题:link 【参考方案1】:

如果您的数据结构与 NLTK 中的 movie_review 语料库完全相同,则有两种方法可以“破解”您的方式:

1.将您的语料库目录放入您保存nltk.data的位置

首先检查您的nltk.data 保存在哪里:

>>> import nltk
>>> nltk.data.find('corpora/movie_reviews')
FileSystemPathPointer(u'/home/alvas/nltk_data/corpora/movie_reviews')

然后将你的目录移动到保存nltk_data/corpora的位置:

# Let's make a test corpus like `nltk.corpus.movie_reviews`
~$ mkdir my_movie_reviews
~$ mkdir my_movie_reviews/pos
~$ mkdir my_movie_reviews/neg
~$ echo "This is a great restaurant." > my_movie_reviews/pos/1.txt
~$ echo "Had a great time at chez jerome." > my_movie_reviews/pos/2.txt
~$ echo "Food fit for the ****" > my_movie_reviews/neg/1.txt
~$ echo "Slow service." > my_movie_reviews/neg/2.txt
~$ echo "README please" > my_movie_reviews/README
# Move it to `nltk_data/corpora/`
~$ mv my_movie_reviews/ nltk_data/corpora/

在你的 python 代码中:

>>> import string
>>> from nltk.corpus import LazyCorpusLoader, CategorizedPlaintextCorpusReader
>>> from nltk.corpus import stopwords
>>> my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> mr = my_movie_reviews
>>>
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> for i in documents:
...     print i
... 
([u'Food', u'fit', u'****'], u'neg')
([u'Slow', u'service'], u'neg')
([u'great', u'restaurant'], u'pos')
([u'great', u'time', u'chez', u'jerome'], u'pos')

(详情请见https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21和https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L144)

2。创建您自己的CategorizedPlaintextCorpusReader

如果您无权访问nltk.data 目录并且想使用自己的语料库,请尝试以下操作:

# Let's say that your corpus is saved on `/home/alvas/my_movie_reviews/`

>>> import string; from nltk.corpus import stopwords
>>> from nltk.corpus import CategorizedPlaintextCorpusReader
>>> mr = CategorizedPlaintextCorpusReader('/home/alvas/my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
>>> stop = stopwords.words('english')
>>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
>>> 
>>> for doc in documents:
...     print doc
... 
([u'Food', u'fit', u'****'], 'neg')
([u'Slow', u'service'], 'neg')
([u'great', u'restaurant'], 'pos')
([u'great', u'time', u'chez', u'jerome'], 'pos')

在Creating a custom categorized corpus in NLTK and Python 和Using my own corpus for category classification in Python NLTK 上提出了类似的问题


下面是完整的代码:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = '/home/alvas/my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [(i:(i in tokens) for i in word_features, tag) for tokens,tag in documents[:numtrain]]
test_set = [(i:(i in tokens) for i in word_features, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

【讨论】:

在此之后添加其他行 =)。在documents = ... 行之后使用完全相同的代码。 你能解释一下你的意思吗?我不明白。 要获得最多信息功能,请使用classifier.show_most_informative_features(5) 非常感谢。有用。现在我试着理解代码;) @alvas 我在 word_features = word_features.keys()[:100] 行中遇到错误,说 TypeError: 'dict_keys' object is not subscriptable 。可能的原因是什么?

以上是关于在 NLTK 中使用我自己的语料库而不是 movie_reviews 语料库进行分类的主要内容,如果未能解决你的问题,请参考以下文章

在 scikit-learn 中使用 nltk 搭配作为特征

查找一个短语在英语中是不是“普遍罕见”

在 NLTK/Python 中使用电影评论语料库进行分类

使用 NLTK 创建新语料库

使用 NLTK 创建新语料库

NLTK - 在自定义语料库中解码Unicode