随机森林特征重要性 Python
Posted
技术标签:
【中文标题】随机森林特征重要性 Python【英文标题】:Random Forest Feature Importance Python 【发布时间】:2021-08-29 09:03:08 【问题描述】:在执行超参数调整并为我的分类器获得最佳参数后,我试图从我的数据中获取特征重要性。我还为训练集拟合了我最好的参数,现在我正在尝试获取重要的功能,但我不断收到错误,并尝试了我在互联网上找到的所有可能的解决方案。
在下面查看我的代码;
enter code here
# define models and parameters for hyperparametrs
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
# define grid search
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid =
'bootstrap': [True],
'max_features': ['auto','sqrt'],
'n_estimators': [100,1000]
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = HalvingGridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
cv = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)
steps_3 = [('over', RandomOverSampler()), ('chi_square', SelectKBest(chi2, k=7000)), ('estimator', grid_search)]
pipeline_3 = Pipeline(steps=steps_3)
#fit the model
rf_hyperparameter = pipeline_3.fit(X_train, y_train)
print(rf_hyperparameter)
# print('Best parameter set: %s' % grid_search.best_params_)
print("Best Score:" + str(grid_search.best_score_))
print("Best Parameters: " + str(grid_search.best_params_))
best_parameters = grid_search.best_params_
#fit the best parameters to the training data
rf_best = RandomForestClassifier(bootstrap = True, max_features= 'auto', n_estimators = 1000)
rf_best.fit(X_train, y_train)
feature_importances = pd.DataFrame(rf_best.feature_importances_,
index=X_train.columns,columns=['importance']).sort_values('importance',ascending = False)
feature_importances
运行上述代码后,这是我得到的错误
AttributeError Traceback (most recent call last)
<ipython-input-159-563c7c3e7fc5> in <module>
1 feature_importances = pd.DataFrame(rf_best.feature_importances_,
----> 2 index=X_train.columns,columns=['importance']).sort_values('importance',ascending = False)
3 feature_importances
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
我将非常感谢我能得到的任何意见。谢谢!
【问题讨论】:
train_test_split
完成的部分代码不见了,能否补充一下
是的。请参阅此处的部分。 #Split Train and Test Data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42,stratify=Y)
【参考方案1】:
问题中缺少完成train_test_split
的代码部分。 train_test_split
返回 numpy
数组而不是 pandas 数据帧,因此 X_train.columns
将失败。将 pandas 数据帧本身中的df.columns
作为list
并传入index
应该可以工作。
【讨论】:
以上是关于随机森林特征重要性 Python的主要内容,如果未能解决你的问题,请参考以下文章