Neuraxle AutoML - 为啥会出错?

Posted

技术标签:

【中文标题】Neuraxle AutoML - 为啥会出错?【英文标题】:Neuraxle AutoML - Why is it erroring?Neuraxle AutoML - 为什么会出错? 【发布时间】:2021-05-08 03:16:20 【问题描述】:

我正在按照 AutoML 示例对 Neuraxle 进行试验。 未修改的示例工作正常。 当我修改它以在ChooseOneStepOf(classifiers) 之前包含我自己的管道组件时,它失败了,我不明白为什么。

from neuraxle.base import BaseTransformer
from neuraxle.pipeline import Pipeline
from neuraxle.hyperparams.space import HyperparameterSpace
from neuraxle.steps.numpy import NumpyRavel
from neuraxle.steps.output_handlers import OutputTransformerWrapper
from typing import List

from sklearn.preprocessing import OneHotEncoder
from neuraxle.pipeline import Pipeline
from neuraxle.union import FeatureUnion
from sklearn.impute import SimpleImputer

# sklearn classifiers, and sklearn wrapper for neuraxle
from neuraxle.steps.sklearn import SKLearnWrapper
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.linear_model import RidgeClassifier, LogisticRegression

# neuraxle distributions
from neuraxle.hyperparams.distributions import Choice, RandInt, Boolean, LogUniform

from neuraxle.steps.flow import ChooseOneStepOf
from neuraxle.base import BaseTransformer, ForceHandleMixin
from neuraxle.metaopt.auto_ml import ValidationSplitter
from neuraxle.metaopt.callbacks import ScoringCallback
from sklearn.metrics import accuracy_score
from neuraxle.metaopt.callbacks import MetricCallback
from sklearn.metrics import f1_score, precision_score, recall_score
from neuraxle.metaopt.auto_ml import InMemoryHyperparamsRepository
from neuraxle.plotting import TrialMetricsPlottingObserver
from neuraxle.metaopt.tpe import TreeParzenEstimatorHyperparameterSelectionStrategy
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
from neuraxle.metaopt.auto_ml import AutoML
import os

classifiers: List[BaseTransformer] = [
    SKLearnWrapper(DecisionTreeClassifier(), HyperparameterSpace(
        'criterion': Choice(['gini', 'entropy']),
        'splitter': Choice(['best', 'random']),
        'min_samples_leaf': RandInt(2, 5),
        'min_samples_split': RandInt(1, 3)
    )).set_name('DecisionTreeClassifier'),
    Pipeline([
        OutputTransformerWrapper(NumpyRavel()),
        SKLearnWrapper(RidgeClassifier(), HyperparameterSpace(
            'alpha': Choice([(0.0, 1.0, 10.0), (0.0, 10.0, 100.0)]),
            'fit_intercept': Boolean(),
            'normalize': Boolean()
        ))
    ]).set_name('RidgeClassifier'),
    Pipeline([
        OutputTransformerWrapper(NumpyRavel()),
        SKLearnWrapper(LogisticRegression(), HyperparameterSpace(
            'C': LogUniform(0.01, 10.0),
            'fit_intercept': Boolean(),
            'dual': Boolean(),
            'penalty': Choice(['l1', 'l2']),
            'max_iter': RandInt(20, 200)
        ))
    ]).set_name('LogisticRegression')
]


class ColumnSelectTransformer(BaseTransformer, ForceHandleMixin):

    def __init__(self, required_columns):
        BaseTransformer.__init__(self)
        ForceHandleMixin.__init__(self)
        self.required_columns = required_columns

    def inverse_transform(self, processed_outputs):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X[self.required_columns]


columns = ['BEDCERT', 'RESTOT', 'INHOSP', 'CCRC_FACIL',
           'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS',
           'EXP_TOTAL', 'ADJ_TOTAL']

simple_features = Pipeline([ColumnSelectTransformer(columns),
                            SimpleImputer(missing_values=np.nan,
                                          strategy='mean')])

categorical_features = Pipeline([ColumnSelectTransformer(['OWNERSHIP', 'CERTIFICATION']),
                                 OneHotEncoder(sparse=False)
                                 ])
business_features = FeatureUnion([simple_features,
                                  categorical_features])

p: Pipeline = Pipeline([
    business_features,
    ChooseOneStepOf(classifiers)
])


validation_splitter = ValidationSplitter(test_size=0.20)

scoring_callback = ScoringCallback(
    metric_function=accuracy_score,
    name='accuracy',
    higher_score_is_better=False,
    print_metrics=False
)

callbacks = [
    MetricCallback('f1', metric_function=f1_score, higher_score_is_better=True, print_metrics=False),
    MetricCallback('precision', metric_function=precision_score, higher_score_is_better=True, print_metrics=False),
    MetricCallback('recall', metric_function=recall_score, higher_score_is_better=True, print_metrics=False)
]

hyperparams_repository = InMemoryHyperparamsRepository(cache_folder='cache')

hyperparams_repository.subscribe(TrialMetricsPlottingObserver(
    plotting_folder_name='metric_results',
    save_plots=False,
    plot_trial_on_next=False,
    plot_all_trials_on_complete=True,
    plot_individual_trials_on_complete=False
))

hyperparams_optimizer = TreeParzenEstimatorHyperparameterSelectionStrategy(
    number_of_initial_random_step=10,
    quantile_threshold=0.3,
    number_good_trials_max_cap=25,
    number_possible_hyperparams_candidates=100,
    prior_weight=0.,
    use_linear_forgetting_weights=False,
    number_recent_trial_at_full_weights=25
)

tmpdir = 'cache'
if not os.path.exists(tmpdir):
    os.makedirs(tmpdir)

n_trials = 10
n_epochs = 10

auto_ml = AutoML(
    pipeline=p,
    validation_splitter=validation_splitter,
    refit_trial=True,
    n_trials=n_trials,
    epochs=n_epochs,
    cache_folder_when_no_handle=str(tmpdir),
    scoring_callback=scoring_callback,
    callbacks=callbacks,
    hyperparams_repository=hyperparams_repository
)


def generate_classification_data():
    # data_inputs, expected_outputs = make_classification(
    #     n_samples=10000,
    #     n_repeated=0,
    #     n_classes=3,
    #     n_features=4,
    #     n_clusters_per_class=1,
    #     class_sep=1.5,
    #     flip_y=0,
    #     weights=[0.5, 0.5, 0.5]
    # )

    data = pd.read_csv('./ml-data/providers-train.csv', encoding='latin1')
    fine_counts = data.pop('FINE_CNT')
    fine_totals = data.pop('FINE_TOT')
    cycle_2_score = data.pop('CYCLE_2_TOTAL_SCORE')

    X_train, X_test, y_train, y_test = train_test_split(
        data,
        fine_counts > 1,
        test_size=0.20
    )

    return X_train, y_train, X_test, y_test


X_train, y_train, X_test, y_test = generate_classification_data()

auto_ml = auto_ml.fit(X_train, y_train)



Output as follows:-

/Users/simon/venvs/wqu_q4/bin/python /Users/simon/Dev/wqu_q4/main.py 新试验:“ChooseOneStepOf”:“choice”:“RidgeClassifier” trial 1/10 Traceback(大多数 最近通话最后):文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 785 行,在 _fit_data_container repo_trial_split = self.trainer.execute_trial( 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/trial.py”, 第 290 行,在 exit 提出 exc_val 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 785 行,在 _fit_data_container repo_trial_split = self.trainer.execute_trial( 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 595 行,在 execute_trial self.print_func('success trial score: '.format( 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/trial.py”, 第 570 行,在 exit 中提出 exc_val 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 574 行,在 execute_trial trial_split_description = _get_trial_split_description(文件“/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 876 行,在 _get_trial_split_description json.dumps(repo_trial.hyperparams, sort_keys=True, indent=4) 文件 "/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/init.py", 第 234 行,在转储中返回 cls(文件 "/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/encoder.py", 第 201 行,在编码块 = 列表(块)文件中 "/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/encoder.py", 第 431 行,在 _iterencode 中的 _iterencode_dict(o, _current_indent_level)文件“/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/encoder.py”, 第 405 行,在块文件中的 _iterencode_dict 产量 "/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/encoder.py", 第 438 行,在 _iterencode o = _default(o) 文件中 "/usr/local/Cellar/python@3.9/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/encoder.py", 第 179 行,默认提高 TypeError(f'Object of type o.class.name ' TypeError: 类型类型的对象不是 JSON serializable 在处理上述异常的过程中,另一个异常 发生:回溯(最近一次通话最后一次):文件 “/Users/simon/Dev/wqu_q4/main.py”,第 210 行,在 auto_ml = auto_ml.fit(X_train, y_train) 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/base.py”, 第 3475 行,适合 new_self = self.handle_fit(data_container, context) 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/base.py”, 第 980 行,在 handle_fit new_self = self._fit_data_container(data_container, context) 文件 “/Users/simon/venvs/wqu_q4/lib/python3.9/site-packages/neuraxle/metaopt/auto_ml.py”, 第 802 行,在 _fit_data_container repo_trial_split=repo_trial_split, UnboundLocalError:之前引用的局部变量“repo_trial_split” 分配过程以退出代码 1 结束



【问题讨论】:

Neuraxle 的版本是多少?您可以尝试更新它吗?我看不出这会失败的原因。似乎它试图使用一个不存在的变量,这很奇怪。也许重置你的 .pyc 预编译文件,或者尝试重新安装你的 venv ? 0.5.6 在 pycharm 中,这很奇怪,因为 Github 表明您只有 0.5.5 我想知道它是否与我的 ColumnTransformer 没有以正确的格式返回数据是否需要在一个 numpy 数组中或者我可以返回一个熊猫数据框吗?我只是为此目的创建了虚拟环境,但可以重试。 【参考方案1】:

一些可以帮助您解决当前问题的注意事项:

    “UnboundLocalError:分配前引用的局部变量 'repo_trial_split'”是在 AutoML 循环中的流水线内发生崩溃时发生的错误。您应该将真正的错误记录在您在此处发布的错误之上。此外,Neuraxle 版本 0.5.7(尚未发布,但在 github 上可用)通过添加一个名为“continue_loop_on_error”的参数来解决此问题,您应该将其设置为 False。

    您似乎在为您的 ColumnSelectTransformer 实例使用 ForceHandleMixin。使用 ForceHandleMixin 意味着您应该定义以下函数 _fit_data_container、_transform_data_container 和 _fit_transform_data_container 而不是 fit/fit_transform/transform。

    您可能需要编写一个 Neuraxle 类来包装 scikit 的 SimpleImputer。

希望这对您有所帮助。完成这些更改后,请随时在此处发布更新,我很乐意为您提供帮助。您也可以在 Neuraxle 的 Slack 上发帖,我可能会在那里更快地回答。

干杯!

附言另一方面,我将在接下来的几天内发布 0.5.7 版本。

【讨论】:

感谢您的回复,我会进一步调查

以上是关于Neuraxle AutoML - 为啥会出错?的主要内容,如果未能解决你的问题,请参考以下文章

为云 AutoML 导入谷歌云时出错

在 Cloud AutoML Vision 中将图像导入 Google 存储时出错

如何最好地处理 Neuraxle 管道中的错误和/或丢失数据?

Pandas DataFrame 中的 Neuraxle 选择列

c++ 指针基础问题 指针已经被初始化为NULL了,为啥还会出错?

Neuraxle 中的默认超参数值