ValueError：输入具有 n_features=12，而模型已使用 n_features=2494 进行了训练

Posted 2023-03-12

技术标签:

【中文标题】ValueError：输入具有 n_features=12，而模型已使用 n_features=2494 进行了训练【英文标题】：ValueError: Input has n_features=12 while the model has been trained with n_features=2494 【发布时间】：2021-11-06 19:29:24 【问题描述】：

我已经使用 count_vectorizer、Tfidf_transformer 和 sgd 分类器训练了一个模型。

这是分词器部分

from keras.preprocessing.text import Tokenizer
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`|~', lower=True)
tokenizer.fit_on_texts(master_df['Observation'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

我训练了模型

from sklearn.linear_model import SGDClassifier
cv=CountVectorizer(max_df=1.0,min_df=1, stop_words=stop_words, max_features=10000, ngram_range=(1,3))
X=cv.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42, stratify=y)
sgd = Pipeline([('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
               ])
sgd.fit(X_train, y_train)


y_pred = sgd.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

这部分工作正常当我尝试使用这个模型来预测使用这个代码时

sentence="Drill was not in operation in the mine at the time of visit."
test=preprocess_text(sentence)
test=test.lower()
print(test)
test=[test] 
tokenizer.fit_on_texts(test)
word_index = tokenizer.word_index
#print(word_index)
test1=cv.transform(test)
print(test1)
output=sgd.predict(test1)
output

它给了我这个错误。

ValueError: Input has n_features=12 while the model has been trained with n_features=2494
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18044/596445027.py in <module>
      9 test1=cv.fit_transform(test)
     10 print(test1)
---> 11 output=sgd.predict(test1)
     12 output

~\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    118 
    119         # lambda, but not partial, allows help() to work with update_wrapper
--> 120         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    121         # update the docstring of the returned function
    122         update_wrapper(out, self.fn)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
    416         Xt = X
    417         for _, name, transform in self._iter(with_final=False):
--> 418             Xt = transform.transform(Xt)
    419         return self.steps[-1][-1].predict(Xt, **predict_params)
    420 

~\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, X, copy)
   1491             expected_n_features = self._idf_diag.shape[0]
   1492             if n_features != expected_n_features:
-> 1493                 raise ValueError("Input has n_features=%d while the model"
   1494                                  " has been trained with n_features=%d" % (
   1495                                      n_features, expected_n_features))

ValueError: Input has n_features=12 while the model has been trained with n_features=2494

我认为问题出在word_index=tokenizer 行，但我不知道如何纠正它。

【问题讨论】：

【参考方案1】：

我们从不 fit_transform 测试集；我们只使用transform。改为

test1=cv.transform(test)

同样，您不应使用tokenizer.fit_on_texts(test) 将标记器重新安装在 test 数据上；你应该把它改成

tokenizer.texts_to_sequences(test)

有关Tokenizer 的更多信息，请参阅documentation 和SO 线程What does Keras Tokenizer method exactly do?。

【讨论】：

以上是关于ValueError：输入具有 n_features=12，而模型已使用 n_features=2494 进行了训练的主要内容，如果未能解决你的问题，请参考以下文章