使用 NLTK 的 SklearnClassifier 和 ClassifierBasedPOSTagger 构建自己的基于分类器的词性标注器

Posted

技术标签:

【中文标题】使用 NLTK 的 SklearnClassifier 和 ClassifierBasedPOSTagger 构建自己的基于分类器的词性标注器【英文标题】:Building own classifier based POS tagger using NLTK's SklearnClassifier and ClassifierBasedPOSTagger 【发布时间】:2021-10-14 16:47:39 【问题描述】:

我正在尝试使用SklearnClassifierClassifierBasedPOSTagger 构建我自己的基于分类器的词性标注器。我试过的代码如下。

from nltk.corpus import treebank
nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
                                      classifier_builder=bnb.train)

# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

此代码产生以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\ABDULL~1.IMR\AppData\Local\Temp/ipykernel_6580/266992580.py in <module>
      4 
      5 bnb = SklearnClassifier(BernoulliNB())
----> 6 bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
      7                                       classifier_builder=bnb.train)
      8 

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in __init__(self, feature_detector, train, classifier_builder, classifier, backoff, cutoff_prob, verbose)
    637 
    638         if train:
--> 639             self._train(train, classifier_builder, verbose)
    640 
    641     def choose_tag(self, tokens, index, history):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, classifier_builder, verbose)
    673         if verbose:
    674             print("Training classifier ( instances)".format(len(classifier_corpus)))
--> 675         self._classifier = classifier_builder(classifier_corpus)
    676 
    677     def __repr__(self):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\classify\scikitlearn.py in train(self, labeled_featuresets)
    110 
    111         X, y = list(zip(*labeled_featuresets))
--> 112         X = self._vectorizer.fit_transform(X)
    113         y = self._encoder.fit_transform(y)
    114         self._clf.fit(X, y)

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in fit_transform(self, X, y)
    288             Feature vectors; always 2-d.
    289         
--> 290         return self._transform(X, fitting=True)
    291 
    292     def inverse_transform(self, X, dict_type=dict):

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in _transform(self, X, fitting)
    233                     if feature_name in vocab:
    234                         indices.append(vocab[feature_name])
--> 235                         values.append(self.dtype(v))
    236 
    237             indptr.append(len(indices))

TypeError: float() argument must be a string or a number, not 'NoneType'

怎么做才对?

【问题讨论】:

【参考方案1】:

根据issue 的评论,这是 scikit-learn 中的错误的结果。当输入参数X 包含到None 的映射时,Scikit-learn 的DictVectorizersklearn/feature_extraction/_dict_vectorizer.py 中的_transform 方法失败。根据Tom Aarsen,我们现在可以使用以下示例来完成工作:

import nltk
from nltk.corpus import treebank

from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3]
test_data = data[3:]

class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):

    def feature_detector(self, tokens, index, history):
        return 
            key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
            for key, value in super().feature_detector(tokens, index, history).items()
        

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = CustomClassifierBasedPOSTagger(train=train_data,
                                            classifier_builder=bnb.train,
                                            verbose=True)

sentence = "This is a sample sentence which I just made for fun."
# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

输出会是这样的:

[nltk_data] Downloading package treebank to C:\Users\Tom/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
Constructing training corpus for classifier.
Training classifier (58 instances)
0.09338289371682999
[('This', 'NNP'), ('is', 'NNP'), ('a', 'NNP'), ('sample', 'NNP'), ('sentence', 'NNP'), ('which', 'NNP'), ('I', 'NNP'), ('just', 'NNP'), ('made', 'NNP'), ('for', 'NNP'), ('fun', 'NNP'), ('.', 'NNP')]

【讨论】:

以上是关于使用 NLTK 的 SklearnClassifier 和 ClassifierBasedPOSTagger 构建自己的基于分类器的词性标注器的主要内容,如果未能解决你的问题,请参考以下文章

NLTK 和语言检测

使用 NLTK 创建新语料库

使用 NLTK 创建新语料库

如何使用 NLTK 检查不可读的 OCRed 文本

如何在 Python-NLTK 中使用 Gale-Church 算法?

在 Python 中使用 NLTK 和 Pandas 解压的值太多