是否有用于运行多个分类器的 Python 管道函数?

Posted

技术标签:

【中文标题】是否有用于运行多个分类器的 Python 管道函数?【英文标题】:Is there a Python Pipeline function for running multiple Classifiers? 【发布时间】:2021-06-03 09:07:51 【问题描述】:

作为一般经验法则,需要在数据集上运行基线模型。我知道H2O- AutoML 和其他 AUt​​oML 包可以做到这一点。但我想尝试使用 Scikit-learn Pipeline,

这是我到目前为止所做的,

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score, make_scorer
import os
rs = 'random_state': 42

X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.6, **rs)
X_val, X_test, y_val, y_test, = train_test_split(X_test, y_test, train_size=0.5, **rs)
# Classification - Model Pipeline
def train_models(X_train, X_val, X_test, y_train, y_val, y_test):
    log_reg = LogisticRegression(**rs)
    nb = BernoulliNB()
    knn = KNeighborsClassifier()
    svm = SVC(**rs)
    mlp = MLPClassifier(max_iter=5000, **rs)
    dt = DecisionTreeClassifier(**rs)
    et = ExtraTreesClassifier(**rs)
    rf = RandomForestClassifier(**rs)
    xgb = XGBClassifier(**rs, verbosity=0)
    scorer = make_scorer(f1_score)

    clfs = [('Logistic Regression', log_reg), ('Naive Bayes', nb),
            ('K-Nearest Neighbors', knn), ('SVM', svm), 
            ('MLP', mlp), ('Decision Tree', dt), ('Extra Trees', et), 
            ('Random Forest', rf), ('XGBoost', xgb)]
    pipelines = []
    scores_df = pd.DataFrame(columns=['Model', 'Val_Score', 'F1_Score'])
    test_scores = []
    for clf_name, clf in clfs:
        pipeline = Pipeline(steps=[
            ('scaler', StandardScaler()),
            ('classifier', clf)])
        pipeline.fit(X_train, y_train)
        val_score = cross_val_score(pipeline, X_val, y_val, scoring=scorer, cv=3).mean()
        print(f'clf_name\n"-" * 30\nModel Val-Score: val_score:.4f')
        test_score = f1_score(y_test, pipeline.predict(X_test))
        print(f'Model F1-Score: test_score:.4f\n\n')
        pipelines.append(pipeline)
        scores_df = scores_df.append('Model': clf_name, 
                                      'Val_Score': val_score, 
                                      'F1_Score': test_score, ignore_index=True)
    return pipelines, scores_df

我只是想通过讨论事情从有经验的程序员那里获得一点知识。我只是期待一个建议/参考或有效的方法/方法来做到这一点。

为机器学习分类问题制作流水线的有效方法是什么?

【问题讨论】:

你看过 bagging 策略 不,你能给我提供任何参考链接吗? Bagging 是一种提高准确性的方法。挑战在于从错误或新数据中学习,而不是过度拟合或学习分类器中的偏差。 检查差异。高方差模型会导致过度拟合。模型复杂度与方差和偏差之间存在权衡,模型越复杂,方差和偏差越小。高方差和低偏差意味着该函数超出了拾取噪声的目标 我包含了一个投票分类器。它选择最好的分类器并使用它的结果。 【参考方案1】:

通常,构建管道来解决需要一个或多个分类器共同工作的特定任务。但是,在您的情况下,有许多分类器独立工作而不是联合工作。如果您想了解更多有关管道的信息,可以查看来自Huggingface的几个示例

以下是情绪分析任务的管道示例:

>>> from transformers import pipeline

>>> nlp = pipeline("sentiment-analysis")

>>> result = nlp("I hate you")[0]
>>> print(f"label: result['label'], with score: round(result['score'], 4)")
label: NEGATIVE, with score: 0.9991

>>> result = nlp("I love you")[0]
>>> print(f"label: result['label'], with score: round(result['score'], 4)")
label: POSITIVE, with score: 0.9999

【讨论】:

【参考方案2】:

尝试投票分类器。它需要一种集成策略来使用投票方法找到最佳分类器。

  # Instantiate lr
  lr = LogisticRegression(random_state=SEED)

  # Instantiate knn
  knn = KNN(n_neighbors=27)

  # Instantiate dt
  dt = DecisionTreeClassifier(min_samples_leaf=1.3, random_state=SEED)

  # Define the list classifiers
 classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

from sklearn.ensemble import VotingClassifier
# Instantiate a VotingClassifier vc 
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
 vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(':s : :.3f'.format(clf_name, accuracy))

尝试 BaggingClassifier 以提高准确性。其他 bagging 分类器有:AdaBoostClassifer 和 GradientBoostingRegressor。此时您可能要考虑使用 pytorch 来克服准确性问题。我看到你有 MLPClassifier,它是一个神经网络,但它不是专门使用的,配置也没有定义。

 from sklearn.ensemble import BaggingClassifier
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.metrics import accuracy_score
 from sklearn.model_selection import train_test_split

 SEED=1


 X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.3, stratify=y,
                                             random_state=SEED)


 dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=1.6, random_state=SEED)

 bc=BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1)


 bc.fit(X_train, y_train)   

 # Evaluate the test set predictions
 y_pred = bc.predict(X_test)

 # Calculate accuracy score
 accuracy = accuracy_score(y_test, y_pred)
 print(':s : :.3f'.format(clf_name, accuracy))

【讨论】:

以上是关于是否有用于运行多个分类器的 Python 管道函数?的主要内容,如果未能解决你的问题,请参考以下文章

是否有用于从 julia 中的类似生成器的函数创建快速迭代器的宏?

如何绘制具有多个数据集的多个分类器的准确性 [关闭]

scikit管道python中的多个分类模型

在 Lucene 分类器中使用多个叶子

选择用于对用户文本数据进行分类的 sklearn 管道

算法链与管道(下):通用的管道接口