如何训练 n-gram 的朴素贝叶斯分类器 (movie_reviews)

Posted

技术标签:

【中文标题】如何训练 n-gram 的朴素贝叶斯分类器 (movie_reviews)【英文标题】:How to train Naive Bayes Classifier for n-gram (movie_reviews) 【发布时间】:2018-06-08 18:58:54 【问题描述】:

下面是在movie_reviews 数据集上为unigram 模型训练Naive Bayes Classifier 的代码。我想通过考虑bigramtrigram 模型来训练和分析它的性能。我们该怎么做。

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def create_word_features(words):
    useful_words = [word for word in words if word not in stopwords.words("english")] 
    my_dict = dict([(word, True) for word in useful_words])
    return my_dict

pos_data = []
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_data.append((create_word_features(words), "positive"))    

neg_data = []
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_data.append((create_word_features(words), "negative")) 

train_set = pos_data[:800] + neg_data[:800]
test_set =  pos_data[800:] + neg_data[800:]

classifier = NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.util.accuracy(classifier, test_set)

【问题讨论】:

你看到帖子了吗:n-grams with Naive Bayes classifier? 或:n-grams with Naive Bayes classifier Error 【参考方案1】:

只需更改您的特征化器

from nltk import ngrams

def create_ngram_features(words, n=2):
    ngram_vocab = ngrams(words, n)
    my_dict = dict([(ng, True) for ng in ngram_vocab])
    return my_dict

顺便说一句,如果您将特征化器更改为使用一组停用词列表并只初始化一次,您的代码会快很多。

stoplist = set(stopwords.words("english"))

def create_word_features(words):
    useful_words = [word for word in words if word not in stoplist] 
    my_dict = dict([(word, True) for word in useful_words])
    return my_dict

真的应该有人告诉 NLTK 人员将停用词列表转换为集合类型,因为它在“技术上”是唯一的列表(即集合)。

>>> from nltk.corpus import stopwords
>>> type(stopwords.words('english'))
<class 'list'>
>>> type(set(stopwords.words('english')))
<class 'set'>

为了有趣的基准测试

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams

def create_ngram_features(words, n=2):
    ngram_vocab = ngrams(words, n)
    my_dict = dict([(ng, True) for ng in ngram_vocab])
    return my_dict

for n in [1,2,3,4,5]:
    pos_data = []
    for fileid in movie_reviews.fileids('pos'):
        words = movie_reviews.words(fileid)
        pos_data.append((create_ngram_features(words, n), "positive"))    

    neg_data = []
    for fileid in movie_reviews.fileids('neg'):
        words = movie_reviews.words(fileid)
        neg_data.append((create_ngram_features(words, n), "negative")) 

    train_set = pos_data[:800] + neg_data[:800]
    test_set =  pos_data[800:] + neg_data[800:]

    classifier = NaiveBayesClassifier.train(train_set)

    accuracy = nltk.classify.util.accuracy(classifier, test_set)
    print(str(n)+'-gram accuracy:', accuracy)

[出]:

1-gram accuracy: 0.735
2-gram accuracy: 0.7625
3-gram accuracy: 0.8275
4-gram accuracy: 0.8125
5-gram accuracy: 0.74

您的原始代码返回的精度为 0.725。

使用更多的 ngram 顺序

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import everygrams

def create_ngram_features(words, n=2):
    ngram_vocab = everygrams(words, 1, n)
    my_dict = dict([(ng, True) for ng in ngram_vocab])
    return my_dict

for n in range(1,6):
    pos_data = []
    for fileid in movie_reviews.fileids('pos'):
        words = movie_reviews.words(fileid)
        pos_data.append((create_ngram_features(words, n), "positive"))    

    neg_data = []
    for fileid in movie_reviews.fileids('neg'):
        words = movie_reviews.words(fileid)
        neg_data.append((create_ngram_features(words, n), "negative")) 

    train_set = pos_data[:800] + neg_data[:800]
    test_set =  pos_data[800:] + neg_data[800:]
    classifier = NaiveBayesClassifier.train(train_set)

    accuracy = nltk.classify.util.accuracy(classifier, test_set)
    print('1-gram to', str(n)+'-gram accuracy:', accuracy)

[出]:

1-gram to 1-gram accuracy: 0.735
1-gram to 2-gram accuracy: 0.7625
1-gram to 3-gram accuracy: 0.7875
1-gram to 4-gram accuracy: 0.8
1-gram to 5-gram accuracy: 0.82

【讨论】:

我认为stoplist = set(stopwords.words("english"))这里不会用到set()函数,因为stopwords.words("english")已经是一个集合了。 stopwords.words("english") 在技术上是一个“集合”,因为它是一个唯一的列表,但本机 python 类型是一个列表。将其转换为一组并仅初始化一次确实可以加快代码速度 =) 哦谢谢我明白了

以上是关于如何训练 n-gram 的朴素贝叶斯分类器 (movie_reviews)的主要内容,如果未能解决你的问题,请参考以下文章

如何生成混淆矩阵并找到朴素贝叶斯分类器的错误分类率?

如何训练以 pos-tag 序列为特征的朴素贝叶斯分类器?

Java:如何坚持 Weka 朴素贝叶斯分类器?

如何在 R 中为 tf-idf 加权 dfm 训练朴素贝叶斯分类器?

如何在 python 的朴素贝叶斯分类器中对用户输入测试集进行分类?

徒手打造一个朴素贝叶斯分类器