删除特征后重新计算特征重要性
Posted
技术标签:
【中文标题】删除特征后重新计算特征重要性【英文标题】:recalculating feature importance after removing a feature 【发布时间】:2021-09-07 09:27:24 【问题描述】:因此,我使用随机森林算法对 iris 数据集进行分类,然后生成特征重要性分数。然后我删除了最不相关的特征,并再次通过 RF 算法运行调整后的数据。我想重新计算特征重要性分数,但我使用的代码仍然需要 4 个特征,因为它使用原始 iris 数据集作为索引,而不是新的 pandas 数据框,只有 3 个我用来训练模型的特征.如何修复我的代码,以免出现此错误:
Traceback (most recent call last):
File "/Users/userPycharmProjects/Iris_classifier_RF/feature_import_reclassify.py", line 61, in <module>
feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
File "/Users/user/.conda/envs/GST/lib/python3.8/site-packages/pandas/core/series.py", line 350, in __init__
raise ValueError(
ValueError: Length of passed values is 3, index implies 4.
所有代码如下:
# Importing required libraries
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
import sklearn.metrics as metrics
# Loading datasets
iris = load_iris()
# Convert to pandas dataframe
iris_data = pd.DataFrame(
'sepal length':iris.data[:,0],
'petal length':iris.data[:,1],
'petal width':iris.data[:,2],
'species':iris.target
)
iris_data.head()
# printing categories (setosa, versicolor, virginica)
print(iris.target_names)
# print flower features
print(iris.feature_names)
# setting independent (X) and dependent (Y) variables
X = iris_data[['sepal length', 'petal length', 'petal width']] # Features
Y = iris_data['species'] # Labels
# printing feature data
print(X[0:3])
# printing dependent variable values (0 = setosa, 1 = versicolor, 3 = virginica)
print(Y)
# splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)
# defining random forest classifier
clfr = RandomForestClassifier(random_state = 100)
clfr.fit(X_train, y_train)
# making prediction
Y_pred = clfr.predict(X_test)
# checking model accuracy
print("Accuracy:", metrics.accuracy_score(y_test, Y_pred))
cm = np.array(confusion_matrix(y_test, Y_pred))
print(cm)
# making predictions on new data
species_id = clfr.predict([[5.1, 3.5, 1.4]])
iris.target_names[species_id]
print(iris.target_names[species_id])
# determining feature importance (e.g. model participation)
feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
print(feature_imp)
import matplotlib.pyplot as plt
import seaborn as sns
# Creating a bar plot to visualize feature participation in model
sns.barplot(x=feature_imp, y=feature_imp.index)
# use '%matplotlib inline' to plot inline in jupyter notebooks
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
【问题讨论】:
【参考方案1】:模型是在 X
上训练的,这只是 iris
的一个子集,但 feature_imp
仍然引用 index=iris.feature_names
。那应该改成index=X.columns
:
feature_imp = pd.Series(clfr.feature_importances_, index=X.columns).sort_values(ascending=False)
【讨论】:
优秀。谢谢。以上是关于删除特征后重新计算特征重要性的主要内容,如果未能解决你的问题,请参考以下文章
如何使用决策树中的 feature_importances_ 删除所有非零重要特征?
Python计算树模型(随机森林xgboost等)的特征重要度及其波动程度:基于熵减的特征重要度计算及可视化基于特征排列的特征重要性(feature permutation)计算及可视化