如何在 scikit-learn 中执行随机森林模型的交叉验证?

Posted

技术标签:

【中文标题】如何在 scikit-learn 中执行随机森林模型的交叉验证?【英文标题】:How to perform cross-validation of a random-forest model in scikit-learn? 【发布时间】:2020-04-08 23:20:56 【问题描述】:

我需要对 RF 模型执行留一法交叉验证。 我成功地建立了一个具有高预测能力的模型。 现在我需要在发布之前进行 LOO 测试。 这是我的代码:

import pandas as pd 
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()

# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features 
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)

y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
X_train = X.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
           "4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
X_train.head(21)

y_train = y.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
           "4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
y_train.head(21)

X_test = X.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
           "4,7,4'-Trimethylallopsoralen", "Psoralen"]]
X_test.head(5)

y_test = y.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
           "4,7,4'-Trimethylallopsoralen", "Psoralen"]]
y_test.head(5)

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)

from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = "n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100], "max_depth":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100]
grid_search_cv_clf = GridSearchCV(clf_rf, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)

from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)

grid_search_cv_clf.best_params_

best_clf = grid_search_cv_clf.best_estimator_
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)

feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame('features': X_test_filtered.columns.values,
                                  'feature_importances':feature_importances)
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(25)

现在我需要 q2 值。

最后,我写了这段代码,得到了相当高的分数 0.9071543776303185 .

from sklearn.model_selection import LeaveOneOut
parameters = "n_estimators":[4], "max_depth":[20]

loo_clf = GridSearchCV(best_clf, parameters, cv=LeaveOneOut())
loo_clf.fit(features_important, y_train)
loo_clf.score(features_important, y_train)  

我不确定它是否是 q2。你觉得怎么样?

我还决定获得 5 倍交叉验证分数。但是,它给出了荒谬的值,例如:-36.58997717, 0.76801832, -1.59900448, 0.1834304, -2.38256389 和 -7.924019361863889 的平均值。

from sklearn.model_selection import cross_val_score
cvs=cross_val_score(best_clf, features_important, y_train)
mean_cross_val_score = cvs.mean()
mean_cross_val_score

大概有办法解决吧?

【问题讨论】:

q2 的分数是多少? 这是留一法交叉验证分数。 然后在下面查看我的答案 【参考方案1】:

在进行模型评估之前,您不应运行超参数搜索。相反,您应该进行 2 次交叉验证,否则,您会泄露一些信息。要了解更多信息,您应该查看 scikit-learn 文档中的以下示例:https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py

因此,在您的特定用例中,您应该使用:GridSearchCV、SelectFromModel 和 cross_val_score:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

X, y = make_regression(n_samples=100)

feature_selector = SelectFromModel(
    RandomForestRegressor(n_jobs=-1), threshold="mean"
)
pipe = make_pipeline(
    feature_selector, RandomForestRegressor(n_jobs=-1)
)

param_grid = 
    # define the grid of the random-forest for the feature selection
    "selectfrommodel__estimator__n_estimators": [10, 20],
    "selectfrommodel__estimator__max_depth": [3, 5],
    # define the grid of the random-forest for the prediction
    "randomforestregressor__n_estimators": [10, 20],
    "randomforestregressor__max_depth": [5, 8],

grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, cv=3)
# You can use the LOO in this way. Be aware that this not a good practise,
# it leads to large variance when evaluating your model.
# scores = cross_val_score(pipe, X, y, cv=LeaveOneOut(), error_score='raise')
scores = cross_val_score(pipe, X, y, cv=2, error_score='raise')
score.mean()

【讨论】:

【参考方案2】:

您需要指定scoringcv 参数。


使用这个:

from sklearn.model_selection import cross_val_score

mycv = LeaveOneOut()
cvs=cross_val_score(best_clf, features_important, y_train, scoring='r2',cv = mycv)

mean_cross_val_score = cvs.mean()
print(mean_cross_val_score)

这将使用 LOOCV 返回平均交叉验证的 R2 分数。


更多评分选项请看这里:https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

【讨论】:

以上是关于如何在 scikit-learn 中执行随机森林模型的交叉验证?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 scikit-learn 中保存随机森林?

如何在 Python scikit-learn 中输出随机森林中每棵树的回归预测?

如何在 scikit-learn 的随机森林的 graphviz-graph 中找到一个类?

使用 scikit-learn 并行生成随机森林

如何在 Python scikit-learn 中输出随机森林中每棵树的回归预测?

SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小?