通过先前训练的模型预测看不见的数据

Posted 2023-03-12

技术标签:

【中文标题】通过先前训练的模型预测看不见的数据【英文标题】：Predict unseen data by previously trained model 【发布时间】：2021-07-10 20:59:00 【问题描述】：

我正在使用 Scikit-learn 执行监督机器学习。我有两个数据集。第一个数据集包含具有 X 特征和 Y 标签的数据。第二个数据集仅包含 X 个特征，但没有 Y 标签。我可以成功地对训练/测试数据执行 LinearSVC 并获得测试数据集的 Y 标签。

现在，我想使用我为第一个数据集训练的模型来预测第二个数据集的标签。如何在 Scikit-learn 中使用从第一个数据集到第二个数据集（看不见的标签）的预训练模型？

我尝试的代码 sn-p： 以下来自 cmets 的更新代码：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle


# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
             'Harry potter book is awesome. It rocks',
             'Nutrition is very important',
             'Welcome to library, you can find as many book as you like',
             'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]

# books = 1 : y label
# food = 0 : y label

df = pd.DataFrame('text':some_text,
                   'y_variable': y_variable
                          )

# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape


# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
                                                 labels,
                                                 train_size=0.5,
                                                 random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))


# ----------- Dataset 2: UNSEEN DATASET ----------- #

some_text2 = ['Harry potter books are amazing',
             'Gluten free diet is getting popular']

unseen_df = pd.DataFrame('text':some_text2) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.


# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26


print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)


# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0

它不一定是我上面尝试的泡菜代码。我正在寻找是否有人有建议，或者是否有任何预构建功能可以从 scikit 进行预测？

【问题讨论】：

X_unseen 必须具有与 X_train 和 X_test 相同顺序的相同特征我们训练模型，而不是数据集（编辑标题）；问题显然与tensorflow 无关 - 请不要发送垃圾邮件无关标签（已删除）。 【参考方案1】：

如您所见，您的第一个 tfidf 将您的输入转换为 26 个特征，而您的第二个 tfidf 将它们转换为 11 个特征。因此发生错误是因为X_train 的形状与X_unseen 不同。提示会告诉您，X_unseen 中的每个观察所具有的特征少于 model 被训练接收的特征数量。

在第二个脚本中加载 model 后，您将在文本中安装另一个矢量化器。也就是说，第一个脚本中的tfidf 和第二个脚本中的tfidf 是不同的对象。为了使用model 进行预测，您需要使用原始tfidf 转换X_unseen。为此，您必须导出原始矢量化器，将其加载到新脚本中并使用它转换新数据，然后再将其传递给model。

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)

【讨论】：

我明白了。那么，有没有办法从我训练的模型中预测“看不见”的数据？我在两个数据集中都有相同的 X 特征。您能否通过添加X_train 和X_unseen 的样本来更新问题？那么这可能是一个重塑问题。我更新了我的答案。发生这种情况是因为您在看不见的数据上安装了一个全新的 tfidf，并且该对象与 model 不兼容。非常感谢您的解释，现在对我来说更有意义了。我在上面尝试了您的解决方案以适合上面的示例数据，效果很好。但是，我在我的真实数据集中使用它，我收到了错误NotFittedError: The TF-IDF vectorizer is not fitted for the unseen 很高兴为您提供帮助！ (; 关于你的错误，你必须在训练数据上拟合矢量化器，然后在测试数据上使用拟合的矢量化器。所以你应该总是第一次使用fit_transform，以后的迭代只使用transform。【参考方案2】：

想象一下，您训练了一个 AI 使用发动机、***、机翼和飞行员领结的图片来识别飞机。现在你调用这个相同的 AI 并要求它预测仅带有领结的飞机模型。这就是 scikit-learn 告诉你的：X_unseen 中的特征（= 列）比 X_train 或 X_test 中的少得多。

【讨论】：

【参考方案3】：

忽略第二个数据集并使用 train_test_split 创建您的测试集。

【讨论】：

以上是关于通过先前训练的模型预测看不见的数据的主要内容，如果未能解决你的问题，请参考以下文章

SVM 可以预测看不见的表达吗

sklearn如何使用保存的模型来预测新数据

准确率和预测分类器

通过在一个热编码数据上训练的模型预测新值

测试经过训练的 LSTM 模型后如何预测实际的未来值？

训练 LSTM 神经网络以预测 pybrain、python 中的时间序列