删除特征后重新计算特征重要性

Posted

技术标签:

【中文标题】删除特征后重新计算特征重要性【英文标题】:recalculating feature importance after removing a feature 【发布时间】:2021-09-07 09:27:24 【问题描述】:

因此,我使用随机森林算法对 iris 数据集进行分类,然后生成特征重要性分数。然后我删除了最不相关的特征,并再次通过 RF 算法运行调整后的数据。我想重新计算特征重要性分数,但我使用的代码仍然需要 4 个特征,因为它使用原始 iris 数据集作为索引,而不是新的 pandas 数据框,只有 3 个我用来训练模型的特征.如何修复我的代码,以免出现此错误:

Traceback (most recent call last):
  File "/Users/userPycharmProjects/Iris_classifier_RF/feature_import_reclassify.py", line 61, in <module>
    feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
  File "/Users/user/.conda/envs/GST/lib/python3.8/site-packages/pandas/core/series.py", line 350, in __init__
    raise ValueError(
ValueError: Length of passed values is 3, index implies 4.

所有代码如下:

# Importing required libraries
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
import sklearn.metrics as metrics


# Loading datasets
iris = load_iris()

# Convert to pandas dataframe
iris_data = pd.DataFrame(
    'sepal length':iris.data[:,0],
    'petal length':iris.data[:,1],
    'petal width':iris.data[:,2],
    'species':iris.target
)
iris_data.head()

# printing categories (setosa, versicolor, virginica)
print(iris.target_names)
# print flower features
print(iris.feature_names)

# setting independent (X) and dependent (Y) variables
X = iris_data[['sepal length', 'petal length', 'petal width']]  # Features
Y = iris_data['species']  # Labels


# printing feature data
print(X[0:3])
# printing dependent variable values (0 = setosa, 1 = versicolor, 3 = virginica)
print(Y)

# splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)

# defining random forest classifier
clfr = RandomForestClassifier(random_state = 100)
clfr.fit(X_train, y_train)

# making prediction
Y_pred = clfr.predict(X_test)

# checking model accuracy
print("Accuracy:", metrics.accuracy_score(y_test, Y_pred))
cm = np.array(confusion_matrix(y_test, Y_pred))
print(cm)

# making predictions on new data
species_id = clfr.predict([[5.1, 3.5, 1.4]])
iris.target_names[species_id]
print(iris.target_names[species_id])

# determining feature importance (e.g. model participation)
feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
print(feature_imp)

import matplotlib.pyplot as plt
import seaborn as sns

# Creating a bar plot to visualize feature participation in model
sns.barplot(x=feature_imp, y=feature_imp.index)

# use '%matplotlib inline' to plot inline in jupyter notebooks
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

【问题讨论】:

【参考方案1】:

模型是在 X 上训练的,这只是 iris 的一个子集,但 feature_imp 仍然引用 index=iris.feature_names。那应该改成index=X.columns:

feature_imp = pd.Series(clfr.feature_importances_, index=X.columns).sort_values(ascending=False)

【讨论】:

优秀。谢谢。

以上是关于删除特征后重新计算特征重要性的主要内容,如果未能解决你的问题,请参考以下文章

如何使用决策树中的 feature_importances_ 删除所有非零重要特征?

XGBoost三种特征重要性计算方法对比

如何在 Weka 构建的决策树中找到特征重要性

随机森林中每棵树的每个特征的特征重要性计算

Python计算树模型(随机森林xgboost等)的特征重要度及其波动程度:基于熵减的特征重要度计算及可视化基于特征排列的特征重要性(feature permutation)计算及可视化

用于决策树的 one-hot 编码后如何解释特征重要性