当机器学习模型标准化时如何预测新值 StandardScaler

Posted

技术标签:

【中文标题】当机器学习模型标准化时如何预测新值 StandardScaler【英文标题】:how to predict new values when a machine learning model was standardized StandardScaler 【发布时间】:2020-06-17 07:03:34 【问题描述】:

我正在研究机器学习模型,我有一个包含数据的数据框

我用标准分布对数据进行归一化

scaler = StandardScaler()
df = scaler.fit_transform(df)

我将数据集分为目标和特征

X_df = df[X_characteristics_list]
y_df = df[target]

我分成训练和测试然后训练模型

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)

我预测测试以验证有效性

y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)

但是什么时候可以在现实生活中进行测试,我需要让模型准备好进行预测

如果我只想预测一条记录 假设 [100,20,34] 我不能,因为我需要将记录标准化,并且使用 StandardScaler 对其进行转换不起作用,因为它取决于标准偏差,所以我需要原始数据集

解决此问题的最佳方法是什么。

【问题讨论】:

【参考方案1】:

见下文:

>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
# Create our input and output matrices
>>> X, y = make_classification()
# Split train-test... "test" will be production/unobserved/"real-life" data
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
# What does X_train look like?
>>> X_train
array([[-0.08930702, -2.71113991, -0.93849926, ...,  0.21650905,
         0.68952722,  0.61365789],
       [-0.31143977, -1.87817904,  0.08287492, ..., -0.41332943,
        -0.58967179,  1.7239411 ],
       [-1.62287589,  1.10691318, -0.630556  , ..., -0.35060008,
         1.11270562,  0.08106694],
       ...,
       [-0.59797041,  0.90218081,  0.89983074, ..., -0.54374315,
         1.18534841, -0.03397969],
       [-1.2006559 ,  1.01890955, -1.21617181, ...,  1.76263322,
         1.38280423, -1.0192972 ],
       [ 0.11883425,  1.42952643, -1.23647358, ...,  1.02509208,
        -1.14308885,  0.72096531]])
# Let's scale it
>>> scaler = StandardScaler()
>>> X_train = scaler.fit_transform(X_train)
>>> X_train
array([[ 0.08867642, -1.97950269, -1.1214106 , ...,  0.22075623,
         0.57844552,  0.46487917],
       [-0.10736984, -1.34896243,  0.00808597, ..., -0.37670234,
        -0.6045418 ,  1.57819736],
       [-1.26479555,  0.91071257, -0.78086855, ..., -0.3171979 ,
         0.96979563, -0.06916763],
       ...,
       [-0.36025134,  0.7557329 ,  0.91152449, ..., -0.50041152,
         1.03697478, -0.18452874],
       [-0.89215959,  0.84409499, -1.42847749, ...,  1.68739437,
         1.21957946, -1.17253964],
       [ 0.27237431,  1.15492649, -1.4509284 , ...,  0.98777012,
        -1.116335  ,  0.57247992]])
# Fit the model
>>> model = LogisticRegression()
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
# Now let's use the already-fitted StandardScaler object to simply transform
# *not fit_transform* the test data
>>> X_test = scaler.transform(X_test)
>>> model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0])

请注意,使用 joblibpickle 您可以保存 scaler 对象并重新加载它以便稍后“实时”缩放。

【讨论】:

当然诀窍是保存缩放器和模型。

以上是关于当机器学习模型标准化时如何预测新值 StandardScaler的主要内容,如果未能解决你的问题,请参考以下文章

机器学习第六周--机器学习重要概念补充

运行经过训练的机器学习模型时出错

我的Azure机器学习Web服务也有同样的结果.如何克服这个问题?

机器学习数据预处理之缺失值:预测填充(回归模型填充分类模型填充)

什么是机器学习泛化能力举例子?

终于有人把可解释机器学习讲明白了