使用 XGBoost 特征重要性分数打印出特征选择中使用的特征
Posted
技术标签:
【中文标题】使用 XGBoost 特征重要性分数打印出特征选择中使用的特征【英文标题】:Printing out Features used in Feature Selection with XGBoost Feature Importance Scores 【发布时间】:2022-01-19 09:05:09 【问题描述】:我正在使用 XGBoost 特征重要性分数在我的 KNN 模型中使用以下代码 (taken from this article) 执行特征选择:
# this section for training and testing the algorithm after feature selection
#dataset spliting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]
# spliting the dataset into train, test and validate for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)
# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_bin_train)
# using normalization technique to feature scale the training data
norm = MinMaxScaler()
X_train= norm.fit_transform(X_train)
X_test= norm.transform(X_test)
#oversampling
smote= SMOTE()
X_train, y_bin_train = smote.fit_resample(X_train,y_bin_train)
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(select_X_train, y_bin_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = knn.predict(select_X_test)
report = classification_report(y_bin_test,y_pred)
print("Thresh= , n= \n " .format(thresh, select_X_train.shape[1], report))
cm = confusion_matrix(y_bin_test, y_pred)
print(cm)
我得到的输出显示了每次迭代使用的特征数量select_X_train.shape[1]
、每次删除特征时使用的阈值thresh
、分类报告和混淆矩阵:
Thresh= 0.0 , n= 17
precision recall f1-score support
0 0.98 0.96 0.97 42930
1 0.87 0.92 0.89 11996
accuracy 0.95 54926
macro avg 0.92 0.94 0.93 54926
weighted avg 0.95 0.95 0.95 54926
[[41226 1704]
[ 909 11087]]
Thresh= 0.007143254857510328 , n= 16
precision recall f1-score support
0 0.98 0.96 0.97 42930
1 0.87 0.92 0.89 11996
accuracy 0.95 54926
macro avg 0.92 0.94 0.93 54926
weighted avg 0.95 0.95 0.95 54926
[[41226 1704]
[ 909 11087]]
此输出将一直持续到使用的特征数达到 1 (n=1)。 我想要做的是我还想在每次迭代中包含使用(或删除)的功能的名称,但我无法弄清楚。 有没有办法完成它?
【问题讨论】:
【参考方案1】:你可以使用
X.columns[selector.get_support()].to_list()
提取所选特征的名称列表,其中X
是具有特征值的pandas 数据框,selector
是SelectFromModel
元转换器。另见this answer。
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
# generate some data
df = pd.DataFrame(
'x1': np.random.normal(0, 1, 100),
'x2': np.random.normal(2, 3, 100),
'x3': np.random.normal(4, 5, 100),
'y': np.random.choice([0, 1], 100),
)
# extract the features and target
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
# scale the data
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# resample the data
smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)
# fit the XGBoost classifier using all the features
model = XGBClassifier()
model.fit(X_train, y_train)
# fit the KNN classifier using each feature importance
# value as a feature selection threshold
thresholds = np.sort(model.feature_importances_)
for threshold in thresholds:
# select the features
selector = SelectFromModel(model, threshold=threshold, prefit=True)
X_train_ = selector.transform(X_train)
X_test_ = selector.transform(X_test)
# extract the names of the selected features
selected_features = X.columns[selector.get_support()].to_list()
# train the model
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X_train_, y_train)
# generate the model predictions
y_pred = knn.predict(X_test_)
# calculate the model performance metrics
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print('Threshold: '.format(threshold))
print('Selected features: \n '.format(selected_features))
print('Confusion matrix: \n '.format(cm))
print('Classification report: \n '.format(report))
print('----------------------------')
# Threshold: 0.2871088981628418
# Selected features:
# ['x1', 'x2', 'x3']
# Confusion matrix:
# [[6 0]
# [7 7]]
# Classification report:
# precision recall f1-score support
#
# 0 0.46 1.00 0.63 6
# 1 1.00 0.50 0.67 14
#
# accuracy 0.65 20
# macro avg 0.73 0.75 0.65 20
# weighted avg 0.84 0.65 0.66 20
#
# ----------------------------
# Threshold: 0.34210699796676636
# Selected features:
# ['x1', 'x3']
# Confusion matrix:
# [[ 4 2]
# [10 4]]
# Classification report:
# precision recall f1-score support
#
# 0 0.29 0.67 0.40 6
# 1 0.67 0.29 0.40 14
#
# accuracy 0.40 20
# macro avg 0.48 0.48 0.40 20
# weighted avg 0.55 0.40 0.40 20
#
# ----------------------------
# Threshold: 0.37078407406806946
# Selected features:
# ['x1']
# Confusion matrix:
# [[3 3]
# [5 9]]
# Classification report:
# precision recall f1-score support
#
# 0 0.38 0.50 0.43 6
# 1 0.75 0.64 0.69 14
#
# accuracy 0.60 20
# macro avg 0.56 0.57 0.56 20
# weighted avg 0.64 0.60 0.61 20
#
# ----------------------------
【讨论】:
以上是关于使用 XGBoost 特征重要性分数打印出特征选择中使用的特征的主要内容,如果未能解决你的问题,请参考以下文章