为啥我不能直接使用特征矩阵进行预测?

Posted

技术标签:

【中文标题】为啥我不能直接使用特征矩阵进行预测?【英文标题】:Why couldn't I predict directly using Features Matrix?为什么我不能直接使用特征矩阵进行预测? 【发布时间】:2019-02-04 21:43:19 【问题描述】:

[已解决]以下过程是我处理新数据并尝试使用数据和训练有素的模型进行预测但失败的过程。

首先我导入,

import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math

%matplotlib inline

导入数据和数据处理

##test
##prepare test_data
x_test_data = pd.read_csv('AW_test.csv')
x_test_data.loc[:,x_test_data.dtypes==object].isnull().sum()

##dropnan
cols_of_interest = ['Title','MiddleName','Suffix','AddressLine2']
x_test_data.drop(cols_of_interest,axis=1,inplace=True)

##dropduplicate
x_test_data.drop_duplicates(subset = 'CustomerID', keep = 'first', 
inplace=True)
print(x_test_data.shape)

然后我将分类变量特征转换为单热编码矩阵

##change categorical variables to numeric variables
def encode_string(cat_features):
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    enc_cat_features = enc.transform(cat_features)
    ohe = preprocessing.OneHotEncoder()
    encoded = ohe.fit(enc_cat_features.reshape(-1,1))
    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

categorical_columns = 
['CountryRegionName','Education','Occupation','Gender','MaritalStatus']
Features = encode_string(x_test_data['CountryRegionName'])
for col in categorical_columns:
    temp = encode_string(x_test_data[col])
    Features = np.concatenate([Features, temp],axis=1)
print(Features)

然后,我将其余的数字特征添加到矩阵中

##add numeric variables
Features = np.concatenate([Features, 
np.array(x_test_data[['HomeOwnerFlag','NumberCarsOwned',
'TotalChildren','YearlyIncome']])], axis=1)

接下来,我缩放特征矩阵

##scale numeric variables
with open('./lin_reg_scaler.pickle', 'rb') as file:
scaler =pickle.load(file)
Features[:,-5:] = scaler.transform(Features[:,-5:])

我加载了我在另一个文件中训练的线性回归模型(如果需要我可以发布它)

# Loading the saved linear regression model pickle
import pickle
loaded_model = pickle.load(open('./lin_reg_mod.pickle', 'rb'))

我把我的特征矩阵直接放在

#predict
loaded_model.predict(Features)

但是,这就是我得到的

array([-5.71697209e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,
   -4.64634881e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,
   -5.71697209e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,
   -4.64634881e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,
   -4.64634881e+12, -5.71697209e+12, -5.71697209e+12, -5.71697209e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,
   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,
   -5.71697209e+12, -4.64634881e+12, -5.71697209e+12, -5.71697209e+12,
   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,
   -5.71697209e+12, -5.71697209e+12, -4.64634881e+12, -4.64634881e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -5.71697209e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,
   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,
   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -5.71697209e+12,
   -5.71697209e+12, -5.71697209e+12, -5.71697209e+12, -4.64634881e+12,............

在我的另一个文件中,我已经成功地训练了我的模型并使用我的测试数据对其进行了测试。

这是我在该文件中将 x_test 输入我的模型时得到的(我想要得到的结果):

[83.75482221 66.31820493 47.22211384 ... 69.65032224 88.45908874
  58.45193545]

不知道怎么回事,求大神帮忙

[更新]下面是我训练模型的代码

custs = pd.read_csv('combined_custs.csv')
custs.dtypes

##avemonthspend data
ams = pd.read_csv('AW_AveMonthSpend.csv')
ams.drop_duplicates(subset='CustomerID', keep='first', inplace=True)
##merge
combined_custs=custs.merge(ams)
combined_custs.to_csv('./ams_combined_custs.csv')
combined_custs.head(20)
##change categorical variables to numeric variables
def encode_string(cat_features):
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
enc_cat_features = enc.transform(cat_features)
ohe = preprocessing.OneHotEncoder()
encoded = ohe.fit(enc_cat_features.reshape(-1,1))
return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

categorical_columns = 
['CountryRegionName','Education','Occupation','Gender','MaritalStatus']
Features = encode_string(combined_custs['CountryRegionName'])
for col in categorical_columns:
    temp = encode_string(combined_custs[col])
    Features = np.concatenate([Features, temp],axis=1)
print(Features.shape)
print(Features[:2,:])

##add numeric variables
Features = np.concatenate([Features, 


np.array(combined_custs[['HomeOwnerFlag',
'NumberCarsOwned','TotalChildren','YearlyIncome']])], axis=1)

print(Features.shape)
print(Features)

##train_test_split
nr.seed(9988)
labels = np.array(combined_custs['AveMonthSpend'])
indx = range(Features.shape[0])
indx = ms.train_test_split(indx, test_size = 300)
x_train = Features[indx[0],:]
y_train = np.ravel(labels[indx[0]])
x_test = Features[indx[1],:]
y_test = np.ravel(labels[indx[1]])
print(x_test.shape)

##scale numeric variables
scaler = preprocessing.StandardScaler().fit(x_train[:,-5:])

x_train[:,-5:] = scaler.transform(x_train[:,-5:])
x_test[:,-5:] = scaler.transform(x_test[:,-5:])
x_train[:2,]

import pickle
file = open('./lin_reg_scaler.pickle', 'wb')
pickle.dump(scaler, file)
file.close()

##define and fit the linear regression model
lin_mod = linear_model.LinearRegression(fit_intercept=False)
lin_mod.fit(x_train,y_train)
print(lin_mod.intercept_)
print(lin_mod.coef_)

import pickle
file = open('./lin_reg_mod.pickle', 'wb')
pickle.dump(lin_mod, file)
file.close()

lin_mod.predict(x_test)

我的训练模型的预测是:

array([ 78.20673535,  91.11860042,  75.27284767,  63.69507673,
   102.10758616,  74.64252358,  92.84218321,  77.9675721 ,
   102.18989779,  96.98098962,  87.61415378,  39.37006326,
    85.81839618,  78.41392293,  45.49439829,  48.0944897 ,
    36.06024114,  70.03880373, 128.90267485,  54.63235443,
    52.20289729,  82.61123334,  41.58779815,  57.6456416 ,
    46.64014991,  78.38639454,  77.61072157,  94.5899366 ,.....

【问题讨论】:

您需要对测试数据进行与对训练数据相同的处理。我的意思是,在训练数据上,你也会进行 one-hot 编码、缩放等。就像你从之前的训练中保存了最终的 LR 模型一样,你需要保存其他东西并在这里使用它们。很可能,由于这里的比例变化,你得到了错误的结果。 好的,所以我保存了以前的缩放器并在此处使用它,结果更改但仍然是这样的:4.62561314e+12,-5.22531829e+13,-5.22531828e+13,-5.22531824e+ 13, -5.22531841e+13, -5.22531838e+13, 4.62561299e+12, -5.22531829e+13, 4.62561197e+12, -5.22531837e+13, 4.62561329e+12, -5.225.32532e+383, -5.225.325321 +13, -5.22531835e+13, 4.62561234e+12, -5.22531831e+13, -5.22531827e+13, ....4.62561314e+12, -5.22531829e+13 这两个数字只是不断重复 我在下面添加了我的训练过程代码 【参考方案1】:

您在训练和测试中都使用此方法:

def encode_string(cat_features):
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    enc_cat_features = enc.transform(cat_features)
    ohe = preprocessing.OneHotEncoder()
    encoded = ohe.fit(enc_cat_features.reshape(-1,1))
    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

通过调用:

Features = encode_string(combined_custs['CountryRegionName'])
for col in categorical_columns:
    temp = encode_string(combined_custs[col])
    Features = np.concatenate([Features, temp],axis=1)

但正如我在上面的评论中所说,您需要对测试应用与训练中相同的预处理。

这里发生的情况是,在测试期间,根据x_test_data 中的数据顺序,编码会发生变化。所以也许一个字符串值在训练期间得到了数字 0,现在得到了数字 1,并且最终 Features 中的特征顺序发生了变化。

要解决这个问题,您需要分别保存每一列的 LabelEncoder 和 OneHotEncoder。

所以在训练期间,这样做:

import pickle
def encode_string(cat_features):
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    enc_cat_features = enc.transform(cat_features)

    # Save the LabelEncoder for this column
    encoder_file = open('./'+cat_features+'_encoder.pickle', 'wb')
    pickle.dump(lin_mod, encoder_file)
    encoder_file.close()

    ohe = preprocessing.OneHotEncoder()
    encoded = ohe.fit(enc_cat_features.reshape(-1,1))

    # Same for OHE
    ohe_file = open('./'+cat_features+'_ohe.pickle', 'wb')
    pickle.dump(lin_mod, ohe_file)
    ohe_file.close()

    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

然后,在测试期间:

def encode_string(cat_features):
    # Load the previously saved encoder
    with open('./'+cat_features+'_encoder.pickle', 'rb') as file:
        enc = pickle.load(file)

    # No fitting, only transform
    enc_cat_features = enc.transform(cat_features)

    # Same for OHE
    with open('./'+cat_features+'_ohe.pickle', 'rb') as file:
        enc = pickle.load(file)

    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

【讨论】:

非常感谢!!!!!!!!!!我得到了我现在想要的结果。顺便说一句,我在 encode_string 函数中添加了代码“column_name = str(cat_features.name)”并将代码“encoder_file = open('./'+cat_features+'_encoder.pickle', 'wb')”更改为代码“ with open('./'+column_name+'_encoder.pickle', 'rb') as file:" ,因为会导致类型错误。

以上是关于为啥我不能直接使用特征矩阵进行预测?的主要内容,如果未能解决你的问题,请参考以下文章

线性回归有解析解为啥还要用梯度下降

如何识别影响预测结果的特征?

为啥我的检测分数很高,尽管在预测过程中有明显的错误分类?

是否有原因为啥仅存在于给定类中的特征没有被强烈预测到该类中?

特征重要性可以用来解释模型预测的“为啥以及哪个特征有贡献”?

02-06 普通线性回归(波斯顿房价预测)+特征选择