为啥 Xgboost 中的功能不匹配错误
Posted
技术标签:
【中文标题】为啥 Xgboost 中的功能不匹配错误【英文标题】:Why does feature mismatch error in Xgboost为什么 Xgboost 中的功能不匹配错误 【发布时间】:2021-09-23 00:34:43 【问题描述】:我正在使用 XGboost 进行增量学习,但在执行以下代码时出现错误。
初始训练和后续训练中使用的列是相同的。列的数据类型也相同。我仍然收到错误消息。
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
data = pd.read_excel("./Precipitation.xlsx")
#Dropping all the rows with Negative NaN Values
for i in range(len(data.S07)):
if (data.S07[i]<0):
data.S07[i]=None #Setting all the rows that contained negative value as NaN
for j in range(len(data.S30)):
if (data.S30[i]<0):
data.S30[i]= None #Setting all the rows that contained negative value as NaN
data.dropna(inplace=True)
data.drop_duplicates(subset=['S07','S30'], inplace=True)
data.reset_index(drop=True, inplace=True)
X=data.S07
y=data.S30
X=pd.DataFrame(X)
y = pd.Series(y,index=X.index)
# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X): # this looks silly
pass
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]
xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = 'objective': 'reg:linear', 'verbose': False
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
params.update('process_type': 'update',
'updater' : 'refresh',
'refresh_leaf': True)
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
model_2_v2_update.save_model('model_1.model')
print('full train\t',mae(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mae(model_1.predict(xg_test), y_test))
print('model 2 \t',mae(model_2_v1.predict(xg_test), y_test)) # "before"
print('model 1+2\t',mae(model_2_v2.predict(xg_test), y_test)) # "after"
print('model 1+update2\t',mae(model_2_v2_update.predict(xg_test), y_test)) # "after"
#Loading dataset 2
data2 = pd.read_excel('./June-1-2020-To-June-25-2021_no_grpah.xlsx')
data2.dropna(inpl
#Dropping all the rows with Negative NaN Values
for i in range(len(data2.S07)):
if (data2.S07[i]<0):
data2.S07[i]=None #Setting all the rows that contained negative value as NaN
for j in range(len(data2.S30)):
if (data2.S30[j]<0):
data2.S30[j]= None #Setting all the rows that contained negative value as NaN
data2.dropna(inplace=True)
data2.sort_values("S07", inplace=True)
data2.drop_duplicates(subset=['S07','S30'], inplace=True)
data2.reset_index(drop=True, inplace=True)
x=data2.S07
y=data2.S30
X=pd.DataFrame(X)
y = pd.Series(y,index=X.index)
xg_train_latest = xgb.DMatrix(x, label=y)
model_3_v1 = xgb.train(params, xg_train_latest, 30, xgb_model=model_2_v2_update) -->gives below error
ValueError: feature_names mismatch: ['S07'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59']
expected S07 in input data
training data did not have the following fields: f8, f18, f31, f43, f56, f34, f49, f21, f27, f35, f44, f22, f14, f2, f19, f9, f55, f20, f39, f47, f26, f3, f32, f15, f40, f50, f23, f57, f13, f41, f30, f38, f48, f28, f54, f45, f58, f59, f4, f29, f25, f36, f33, f46, f17, f52, f1, f42, f7, f0, f37, f5, f10, f12, f16, f53, f51, f11, f6, f24
请告诉我如何解决此问题。 X 和 Y 具有相同的数据类型和列名
感谢和问候 瓦伦
【问题讨论】:
【参考方案1】:问题已解决,这里的问题是传递给 xg_train_latest = xgb.DMatrix(x, label=y) 的参数是错误的。而不是大写 X 小写 x 被传递
x
【讨论】:
以上是关于为啥 Xgboost 中的功能不匹配错误的主要内容,如果未能解决你的问题,请参考以下文章
XGBRegressor梯度提升回归xgboos 决策树回归