SKlearn SGD Partial Fit 错误：特征数 378 与之前的数据不匹配 4598

Posted 2023-03-12

技术标签:

【中文标题】SKlearn SGD Partial Fit 错误：特征数 378 与之前的数据不匹配 4598【英文标题】：SKlearn SGD Partial Fit error: Number of features 378 does not match previous data 4598 【发布时间】：2017-10-17 10:09:27 【问题描述】：

我有 pkl 我的分类器并在另一个笔记本中打开并尝试在分类器上执行 partial_fit 但收到错误 Number of features 378 does not match previous data 4598。

with open("models/count_vect_Item Group.pkl", 'r') as f:
 global count_vect_item_group
 count_vect_item_group = joblib.load(f)

with open("models/model_Item Group.pkl", 'r') as f:
 global model_predicted_item_group
 model_predicted_item_group = joblib.load(f)

count_matrix_X_train = count_vect_item_group.fit_transform(X_test)
X_train_tf_idf = tf_idf(count_matrix_X_train)

model_predicted_item_group.partial_fit(X_train_tf_idf, labels_test )

无法使用新数据集训练我的分类器。

【问题讨论】：

【参考方案1】：

这个错误是因为在你腌制分类器之前，你用 4598 个特征（X 中的列数）训练它，现在只有 378 个。它应该与旧功能相同。

如何做到这一点，只需致电count_vect_item_group.transform()。您现在再次调用 count_vect_item_group 上的 fit_transform()，然后它会忘记先前学习的数据，并拟合新数据，因此找到的特征数量比以前少。

将您的代码更改为：

count_matrix_X_train = count_vect_item_group.transform(X_test)
X_train_tf_idf = tf_idf(count_matrix_X_train)

model_predicted_item_group.partial_fit(X_train_tf_idf, labels_test)

【讨论】：

第二行的 tf_idf() 是什么？ @wonhee 这是 OP 在他的代码中实现的方法。 count_vect_item_group 是他存储在文件中的分类器，所以 count_vect_item_group.transform() 将返回矩阵，而 tf_idf 假设在这里做什么？也许只是 count_matrix_X_train.toarray() 之类的？ @wonhee count_matrix_X_train 顾名思义是一个计数矩阵。 tf_idf（顾名思义）应该计算矩阵的 tf-idf 值（根据给定的计数）。您可以 ping OP 以获取更多信息。好吧..所以原始帖子说他 pkl 他的分类器，但我认为它实际上是矢量化器？我认为 SGD 分类器根本没有 fit_transform() 或 transform() 之类的东西。

以上是关于SKlearn SGD Partial Fit 错误：特征数 378 与之前的数据不匹配 4598的主要内容，如果未能解决你的问题，请参考以下文章

GridSearchCV/RandomizedSearchCV 与 sklearn 中的 partial_fit

sklearn SGDClassifier fit() 与 partial_fit()

sklearn partial_fit() 未将准确结果显示为 fit()

partial_fit Sklearn 的 MLPClassifier

SKLearn 的 Birch Clustering 中的 partial_fit() 到底是啥，它可以用于非常大的数据集吗？

在 RandomForestRegressor 中使用 Partial_fit() 方法