如何调整管道内随机森林分类器中的参数?
Posted
技术标签:
【中文标题】如何调整管道内随机森林分类器中的参数?【英文标题】:How can I tune the parameters in a Random Forest Classifier inside a pipeline? 【发布时间】:2021-01-13 06:12:34 【问题描述】:我试图通过使用管道并调整其中的参数来应用 RandomForestClassifier()。这是正在使用的数据集:https://www.kaggle.com/gbonesso/enem-2016
这是代码
from sklearn.ensemble import RandomForestClassifier
imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()
rf = RandomForestClassifier()
features = [
"NU_IDADE",
"TP_ESTADO_CIVIL",
"NU_NOTA_CN",
"NU_NOTA_CH",
"NU_NOTA_LC",
"NU_NOTA_MT",
"NU_NOTA_COMP1",
"NU_NOTA_COMP2",
"NU_NOTA_COMP3",
"NU_NOTA_COMP4",
"NU_NOTA_COMP5",
"NU_NOTA_REDACAO",
]
X = enem[features]
y = enem[["IN_TREINEIRO"]]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=42
)
pipeline = make_pipeline(imputer, scaler, rf)
pipe_params =
"randomforestregressor__n_estimators": [100, 500, 1000],
"randomforestregressor__max_depth": [1, 5, 10, 25],
"randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
gridsearch = GridSearchCV(
pipeline, param_grid=pipe_params, cv=3, n_jobs=-1, verbose=1000
)
gridsearch.fit(X_train, y_train)
它似乎适用于一些参数,但随后我收到以下错误消息:
ValueError: Invalid parameter randomforestregressor for estimator Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler()),
('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.
另外,另一个问题是我似乎无法获得 cv 结果。我尝试运行以下代码:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values("rank_test_score").head()
score = pipeline.score(X_test, y_test)
score
但是我收到了这个错误:
AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
关于如何修复这些错误的任何想法?
【问题讨论】:
【参考方案1】:你的问题很可能是这本词典:
pipe_params =
"randomforestregressor__n_estimators": [100, 500, 1000],
"randomforestregressor__max_depth": [1, 5, 10, 25],
"randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
您的管道没有randomforestregressor
参数,正如您的错误所暗示的那样。由于您使用的是 RandomForestClassifier,因此应该是:
pipe_params =
"randomforestclassifier__n_estimators": [100, 500, 1000],
"randomforestclassifier__max_depth": [1, 5, 10, 25],
"randomforestclassifier__max_features": [*np.arange(0.1, 1.1, 0.1)],
如果您运行错误消息中的建议,您将看到管道的可用选项 (pipeline.get_params().keys()
)。
【讨论】:
【参考方案2】:尼克的回答绝对正确,确实可以解决您的问题。在您的情况下,您可以实例化管道以避免 make_pipeline
支持 Pipeline
类。我相信它更具可读性和简洁性:
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier())
])
并使用您的分类器名称访问模型参数:
param_grid =
"clf__n_estimators": [100, 500, 1000],
"clf__max_depth": [1, 5, 10, 25],
"clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
下面是一个基于鸢尾花数据集的完整示例:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
import numpy as np
# Data preparation
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42
)
# Build a pipeline object
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", RandomForestClassifier())
])
# Declare a hyperparameter grid
param_grid =
"clf__n_estimators": [100, 500, 1000],
"clf__max_depth": [1, 5, 10, 25],
"clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
# Perform grid search, fit it, and print score
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1000)
gs.fit(x_train, y_train)
print(gs.score())
【讨论】:
以上是关于如何调整管道内随机森林分类器中的参数?的主要内容,如果未能解决你的问题,请参考以下文章