加载泡菜 NotFittedError:TfidfVectorizer - 未安装词汇

Posted

技术标签:

【中文标题】加载泡菜 NotFittedError:TfidfVectorizer - 未安装词汇【英文标题】:Loading pickle NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted 【发布时间】:2019-12-04 09:25:49 【问题描述】:

多标签分类

我正在尝试使用 scikit-learn/pandas/OneVsRestClassifier/logistic 回归来预测多标签分类。构建和评估模型有效,但尝试对新的示例文本进行分类则无效。

场景 1:

一旦我建立了一个模型,用名称(sample.pkl)保存了模型并重新启动了我的内核,但是当我在预测示例文本的过程中加载保存的模型(sample.pkl)时,得到了错误:

 NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

我构建模型并评估模型,并将其保存为名称为 sample.pkl 的模型。我重新调整我的内核,然后加载模型对示例文本进行预测 NotFittedError: TfidfVectorizer - Vocabulary was not fit

推理

import pickle,os
import collections
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import json, nltk, re, csv, pickle
from sklearn.metrics import f1_score # performance matrix
from sklearn.multiclass import OneVsRestClassifier # binary relavance
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import train_test_split  
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
stop_words = set(stopwords.words('english'))

def cleanhtml(sentence):
'''' remove the tags '''
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext


def cleanPunc(sentence): 
''' function to clean the word of any
    punctuation or special characters '''
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned

def keepAlpha(sentence):
""" keep the alpha sentenes """
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
return alpha_sent

def remove_stopwords(text):
""" remove stop words """
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

test1 = pd.read_csv("C:\\Users\\abc\\Downloads\\test1.csv")
test1.columns

test1.head()
siNo  plot                              movie_name       genre_new
1     The story begins with Hannah...   sing             [drama,teen]
2     Debbie's favorite band is Dream.. the bigeest fan  [drama]
3     This story of a Zulu family is .. come back,africa [drama,Documentary]

出现错误 当我对示例文本进行推理时,我在这里遇到了错误

def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)
    q = remove_stopwords(q)
    multilabel_binarizer = MultiLabelBinarizer()
    tfidf_vectorizer = TfidfVectorizer()
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)


for i in range(5):
    print(i)
    k = test1.sample(1).index[0] 
    print("Movie: ", test1['movie_name'][k], "\nPredicted genre: ", infer_tags(test1['plot'][k])), print("Actual genre: ",test1['genre_new'][k], "\n")

已解决

我解决了我将 tfidf 和 multibiniraze 保存到 pickle 模型中

from sklearn.externals import joblib
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(multilabel_binarizer, open("multibinirizer_vectorizer.pickle", "wb"))
vectorizer = joblib.load('/abc/downloads/tfidf_vectorizer.pickle')
multilabel_binarizer = joblib.load('/abc/downloads/multibinirizer_vectorizer.pickle')


def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)      
    q = remove_stopwords(q)
    q_vec = vectorizer .transform([q])
    q_pred = rf_model.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

我通过下面的链接找到了解决方案 ,How do I store a TfidfVectorizer for future use in scikit-learn?>

【问题讨论】:

【参考方案1】:

发生这种情况是因为您只是将分类器转储到泡菜而不是矢量化器中。

在推理过程中,当你调用时

 tfidf_vectorizer = TfidfVectorizer()

,您的矢量化器不适合训练词汇,这会导致错误。

您应该做的是,将分类器和矢量化器都转储到pickle。在推理期间加载它们。

【讨论】:

我通过使用joblib pickle解决了这个问题:感谢您的支持@0x5050 from sklearn.externals import joblib pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb")) pickle.dump(multilabel_binarizer, open("multibinirizer_vectorizer.pickle", "wb")) multilabel_binarizer = joblib.load('multibinirizer_vectorizer.pickle') vectorizer = joblib.load('tfidf_vectorizer.pickle')跨度>

以上是关于加载泡菜 NotFittedError:TfidfVectorizer - 未安装词汇的主要内容,如果未能解决你的问题,请参考以下文章

无法为 Pascal VOC 泡菜数据集加载泡菜

在泡菜文件中保存和加载多个对象?

加载或倾倒泡菜时如何阻止动画 QCursor 冻结?

使用 cx_freeze 时如何加载泡菜模型?

如何从 S3 加载泡菜文件以在 AWS Lambda 中使用?

为啥我的模型的准确性会根据它是从泡菜加载还是新训练的而改变?