将最佳 GridSearch 分类器写入表格
Posted
技术标签:
【中文标题】将最佳 GridSearch 分类器写入表格【英文标题】:Writing best GridSearch classifiers into a table 【发布时间】:2018-10-22 02:35:21 【问题描述】:我发现并成功测试了以下将 Pipeline 和 GridSearchCV 应用于分类器选择的脚本。该脚本输出最佳分类器及其准确性。
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data
#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])
# Create space of candidate learning algorithms and their hyperparameters
search_space = ['classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10),
'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]]
# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# Fit grid search
best_model = clf.fit(X_train, y_train)
print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred,
average='weighted'))
如何调整脚本,使其不仅输出最佳分类器,即我们示例中的 LogReg,而且输出其他分类器中最好的分类器?以上,我也喜欢看RandomForestClassifier()
的输出。
Ideal 是一种解决方案,其中显示了每个算法(LogReg、RandomForest、..)的最佳分类器,并将每个最佳分类器分类到一个表中。第一列或索引应该是模型,precision_recall_fscore_support
值在右侧的行中。然后该表应按 F-score 排序。
PS:虽然脚本有效,但我不确定管道中LogisticRegression()
的功能是什么,因为它稍后在搜索空间中定义。
解决方案(简化):
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]
seed=1
models = [
'RFC',
'logisticRegression'
]
clfs = [
RandomForestClassifier(random_state=seed,n_jobs=-1),
LogisticRegression()
]
params =
models[0]:'n_estimators':[100],
models[1]: 'C':[1000]
for name, estimator in zip(models,clfs):
print(name)
clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
print("best params: " + str(clf.best_params_))
print("best scores: " + str(clf.best_score_))
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: :.4%".format(acc))
print(classification_report(y_test, y_pred, digits=4))
【问题讨论】:
@Cristopher,希望我的回答有帮助 【参考方案1】:如果我理解正确,这应该可以正常工作。
import pandas as pd
import numpy as np
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')
# OR to avoid the index in the writting
df_final.to_csv('sorted_table2.csv',index=False)
结果:
但是,在这种情况下,不是根据 F 值进行排序。为此,请使用此功能。在GridSearch
中将scoring
属性定义为f1_weighted
并重复我的代码。
示例:
...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')
best_model = clf.fit(X_train, y_train)
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
df_final.to_csv('F_sorted_table.csv')
结果:
【讨论】:
谢谢。我想这几乎是答案。对于上面的示例,我只期望两行:Logistic Regression 和 RandomForest,每行都有最好的模型和结果。我认为模型可以写入一个字典,然后调用每个字典元素的最佳结果。我试过了,但有几个语法问题。 你好。然后,您可以过滤 df_final pandas 数据帧,以保留在第二列中找到的第一个 Logistic 和第一个 RandomForest。以上是关于将最佳 GridSearch 分类器写入表格的主要内容,如果未能解决你的问题,请参考以下文章
GridSearch用于Scikit-learn中的多标签分类