为啥 Xgboost 中的功能不匹配错误

Posted

技术标签:

【中文标题】为啥 Xgboost 中的功能不匹配错误【英文标题】:Why does feature mismatch error in Xgboost为什么 Xgboost 中的功能不匹配错误 【发布时间】:2021-09-23 00:34:43 【问题描述】:

我正在使用 XGboost 进行增量学习,但在执行以下代码时出现错误。

初始训练和后续训练中使用的列是相同的。列的数据类型也相同。我仍然收到错误消息。

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

data = pd.read_excel("./Precipitation.xlsx")
#Dropping all the rows with Negative NaN Values
for i in range(len(data.S07)):
    if (data.S07[i]<0):
        data.S07[i]=None #Setting all the rows that contained negative value as NaN
      
for j in range(len(data.S30)):
    if (data.S30[i]<0):
        data.S30[i]= None  #Setting all the rows that contained negative value as NaN
        
data.dropna(inplace=True)

data.drop_duplicates(subset=['S07','S30'], inplace=True)

data.reset_index(drop=True, inplace=True)

X=data.S07
y=data.S30

X=pd.DataFrame(X)
y = pd.Series(y,index=X.index)

# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X):  # this looks silly
    pass
    
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]

xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)


params = 'objective': 'reg:linear', 'verbose': False
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)


params.update('process_type': 'update',
               'updater'     : 'refresh',
               'refresh_leaf': True)
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
model_2_v2_update.save_model('model_1.model')


print('full train\t',mae(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mae(model_1.predict(xg_test), y_test))  
print('model 2 \t',mae(model_2_v1.predict(xg_test), y_test))  # "before"
print('model 1+2\t',mae(model_2_v2.predict(xg_test), y_test))  # "after"
print('model 1+update2\t',mae(model_2_v2_update.predict(xg_test), y_test))  # "after"


#Loading dataset 2
data2 = pd.read_excel('./June-1-2020-To-June-25-2021_no_grpah.xlsx')
data2.dropna(inpl

#Dropping all the rows with Negative NaN Values
for i in range(len(data2.S07)):
    if (data2.S07[i]<0):
        data2.S07[i]=None #Setting all the rows that contained negative value as NaN
for j in range(len(data2.S30)):
    if (data2.S30[j]<0):
        data2.S30[j]= None  #Setting all the rows that contained negative value as NaN        
        
data2.dropna(inplace=True)        

data2.sort_values("S07", inplace=True)
data2.drop_duplicates(subset=['S07','S30'], inplace=True)
data2.reset_index(drop=True, inplace=True)  


x=data2.S07
y=data2.S30

X=pd.DataFrame(X)
y = pd.Series(y,index=X.index)

xg_train_latest = xgb.DMatrix(x, label=y)


model_3_v1 = xgb.train(params, xg_train_latest, 30, xgb_model=model_2_v2_update) -->gives below error




ValueError: feature_names mismatch: ['S07'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59']
expected S07 in input data
training data did not have the following fields: f8, f18, f31, f43, f56, f34, f49, f21, f27, f35, f44, f22, f14, f2, f19, f9, f55, f20, f39, f47, f26, f3, f32, f15, f40, f50, f23, f57, f13, f41, f30, f38, f48, f28, f54, f45, f58, f59, f4, f29, f25, f36, f33, f46, f17, f52, f1, f42, f7, f0, f37, f5, f10, f12, f16, f53, f51, f11, f6, f24

请告诉我如何解决此问题。 X 和 Y 具有相同的数据类型和列名

感谢和问候 瓦伦

【问题讨论】:

【参考方案1】:

问题已解决,这里的问题是传递给 xg_train_latest = xgb.DMatrix(x, label=y) 的参数是错误的。而不是大写 X 小写 x 被传递

x

【讨论】:

以上是关于为啥 Xgboost 中的功能不匹配错误的主要内容,如果未能解决你的问题,请参考以下文章

XGBRegressor梯度提升回归xgboos 决策树回归

Xgboost 处理不平衡的分类数据

XGBoost原理

kaggle 房价预测经典文章

多类文本分类期间 xgboost sklearn 中的 feature_names 不匹配

机器学习 gbdt-xgboost 决策树提升