为啥 NLTK NaiveBayes 分类器错误分类了一条记录？

Posted 2023-03-12

技术标签:

【中文标题】为啥 NLTK NaiveBayes 分类器错误分类了一条记录？【英文标题】：Why did NLTK NaiveBayes classifier misclassify one record?为什么 NLTK NaiveBayes 分类器错误分类了一条记录？ 【发布时间】：2018-06-28 08:40:54 【问题描述】：

这是我第一次使用 Python 中的 nltk NaiveBayesClassifier 构建情感分析机器学习模型。我知道模型太简单了，但这对我来说只是第一步，下次我会尝试标记化的句子。

我当前模型的真正问题是：我已在训练数据集中将“坏”一词明确标记为负面（从“negative_vocab”变量中可以看出）。然而，当我对列表 ['awesome movie', 'i like it', 'it is so bad'] 中的每个句子（小写）运行 NaiveBayesClassifier 时，分类器错误地将 'it is so bad' 标记为正面。

输入：

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')

def word_feat(word):
    return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].

for word in words:
    classResult = classifier.classify(word_feat(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
    print(str(word) + ' is ' + str(classResult))
    print()

输出：

awesome movie is pos

i like it is pos

it is so bad is pos

为了确保函数 'word_feat(word)' 迭代每个句子而不是每个单词或字母，我做了一些诊断代码来查看 'word_feat(word)' 中的每个元素是什么：

for word in words:
    print(word_feat(word))

然后打印出来：

'awesome movie': True
' i like it': True
' it is so bad': True

所以看起来函数'word_feat(word)'是正确的？

有谁知道为什么分类器将“情况如此糟糕”分类为正面？如前所述，我在训练数据中明确将“坏”一词标记为负面。

【问题讨论】：

你能尝试一个中性词，看看输出是中性的还是积极的？例如breaking bad is really a good drama，bad -> neutral？这是一个统计模型，可能有很多事情会导致您可能不想要的输出，但它可能不会错。例如。预处理、数据偏差、退避策略等您不能期望机器学习模型能够正确分类每个实例。您需要生成一些指标（例如准确性、混淆矩阵等）以评估其性能。计算完这些指标后，您可以分析错误分类的点，看看是否可以通过（例如）引入更多功能来提高性能。您的商家信息中是否存在复制粘贴错误？ word_feats、positive_vocab、negative_vocab、neutral_vocab 都定义了两次。 【参考方案1】：

这个特殊的失败是因为你的 word_feats() 函数需要一个单词列表（一个标记化的句子），但是你将每个单词分别传递给它......所以 word_feats() 迭代它的字母。您已经构建了一个分类器，该分类器根据字符串包含的字母将字符串分类为正数或负数。

您可能处于这种困境中，因为您没有注意变量的名称。在您的主循环中，变量sentence、words 或word 都不包含其名称所声称的内容。要了解和改进您的程序，请从正确命名开始。

除了错误，这不是您构建情感分类器的方式。训练数据应该是标记化句子的列表（每个都标有其情绪），而不是单个单词的列表。同样，您对标记化的句子进行分类。

【讨论】：

我认为我的 word_feats() 函数迭代的是单词，而不是字母。例如，当我运行代码 'word_feats(positive_vocab)' 时，它返回 ''nice': True, 'outstanding': True, 'great': True, 'terrific': True, ':)': True, “好”：是的，“真棒”：是的，“好极了”：是的'。所以它是在迭代单词，对吧？我同意我应该在标记化的句子上构建训练数据，但就像我提到的那样，我仍然是这个领域的新手。一旦我对 NLP 更加熟悉，我将实施标记化的句子。您评论中的示例迭代了单词，因为您向它传递了一个单词列表。您问题中的代码传递了word_feats() 一个字符串，因为您在调用它之前遍历列表。让word_feats() 打印出它的参数和它构建的字典，你会自己看到的。 @Darren 在您的问题下的评论很准确：您实际上定义了两个分类器（第二个覆盖了第一个），一个带有单词列表输入，一个带有字符串输入。但是您的主循环对字符串进行分类。清理你的代码，适当地命名变量，并注意你的数据结构！提问时更是如此。我已经修复了我的代码并在我的问题部分更新了它们。输出仍然将句子“it is so bad”错误分类为正面。当我打印出 'word_feats(words)' 时，'words' 指的是列表 ['awesome movie', 'i like it', 'it is so bad']，它正确打印出 ''awesome movie':对，“我喜欢”：对，“太糟糕了”：对。所以这意味着它必须遍历列表中的每个句子而不是字符串，对吧？如果你想知道你的代码迭代了什么，打印出一些诊断输出。互联网上的随机陌生人，无论多么有经验，都不是那么可靠。【参考方案2】：

这是修改后的代码

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

我修改了您正在考虑将“单词列表”作为分类器输入的位置。但实际上你需要逐句传递，这意味着你需要传递“句子列表”

另外，对于每个句子，您需要传递“words as features”，这意味着您需要将句子拆分为空白字符。

此外，如果您希望分类器在情绪分析中正常工作，您需要减少对“停用词”的偏好，例如“它、它们、是等”。因为这些词不足以决定句子是积极的、消极的还是中性的。

上面的代码给出了下面的输出

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

所以对于任何分类器，训练分类器和预测分类器的输入格式应该相同。在训练时提供单词列表，请尝试使用相同的方法来转换您的测试集。

【讨论】：

谢谢@Gunjan。这对我有很大帮助。如果我是正确的，我认为我的原始脚本的问题之一（除了其他错误）是我将单个句子而不是单个单词传递给了“word_feats”，这混淆了 ML 模型并使其在分类中无效正确的情绪。 @Stanleyrr ：是的，所以基本上当你说你在传递单词时，你实际上是在将你的句子转换为特征列表（在我们的例子中，特征是单词）。在 ML 模型中，您的模型将完全适用于您将提供的功能。删除停用词也可以使您的功能（单词）更加精致。这也会影响您的输出，因为现在模型会忽略“it”、“so”之类的词。【参考方案3】：

让我展示你的代码的重写。我在顶部附近所做的所有更改是添加import re，因为使用正则表达式更容易标记化。定义classifier 之前的所有内容都与您的代码相同。

我又添加了一个测试用例（确实非常消极），但更重要的是，我使用了正确的变量名称 - 这样就很难对正在发生的事情感到困惑：

test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')

所以sentences 现在包含 4 个字符串，每个字符串都是一个句子。我没有改变你的 word_feat() 函数。

为了使用分类器，我做了相当大的重写：

for sentence in sentences:
    if(len(sentence) == 0):continue
    neg = 0
    pos = 0
    for word in re.findall(r"[\w']+", sentence):
        classResult = classifier.classify(word_feat(word))
        print(word, classResult)
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
    print("\n%s: %d vs -%d\n"%(sentence,pos,neg))

外层循环又是描述性的，所以sentence 包含一个句子。

然后我有一个内部循环，我们对句子中的每个单词进行分类；我正在使用正则表达式将句子拆分为空格和标点符号：

 for word in re.findall(r"[\w']+", sentence):
     classResult = classifier.classify(word_feat(word))

其余的只是基本的加法和报告。我得到这个输出：

awesome pos
movie neu

awesome movie: 1 vs -0

i pos
like pos
it pos

 i like it: 3 vs -0

it pos
is neu
so pos
bad neg

 it is so bad: 2 vs -1

i pos
hate neg
this pos
terrible neg
useless neg
movie neu

 i hate this terrible useless movie: 2 vs -3

我仍然和你一样——“太糟糕了”被认为是积极的。通过额外的调试行我们可以看到这是因为“it”和“so”被认为是正面词，而“bad”是唯一的负面词，所以总体上是正面的。

我怀疑这是因为它没有在训练数据中看到这些词。

...是的，如果我将“it”和“so”添加到中性词列表中，我会得到“it is so bad: 0 vs -1”。

作为接下来要尝试的事情，我建议：

尝试更多的训练数据；像这样的玩具示例存在噪声会淹没信号的风险。考虑删除停用词。

【讨论】：

s/re.findall(r"[\w']+",/nltk.word_tokenize(/。作为原则和未来用途的问题...... @Darren，谢谢！这是超级有用的信息。像你一样打印出句子中每个单词的分类是个好主意——我应该更频繁地这样做。所以我在我的“neutral_vocab”变量中添加了“it”、“so”和“really”这三个词，然后再次尝试分类。奇怪的是，“it”、“so”和“really”这个词本身就被归类为中性词。但是当我将句子归类为“非常糟糕”时，它仍然是正面的。此时，我将尝试 Python 中的其他一些情感分析功能，向模型添加更多训练数据并删除停用词。【参考方案4】：

你可以试试这个代码

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
return dict([(word, True) for word in words])

positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0

sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
    neg = neg + 1
if classResult == 'pos':
    pos = pos + 1


print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

结果是：正：0.7142857142857143 负数：0.14285714285714285

【讨论】：

以上是关于为啥 NLTK NaiveBayes 分类器错误分类了一条记录？的主要内容，如果未能解决你的问题，请参考以下文章