XGBoost - n_estimators = 1 等于单树分类器？

Posted 2023-03-12

技术标签:

【中文标题】XGBoost - n_estimators = 1 等于单树分类器？【英文标题】：XGBoost - n_estimators = 1 equal to single-tree classifier? 【发布时间】：2019-04-13 06:48:32 【问题描述】：

我有一些训练管道大量使用 XGBoost 而不是 scikit-learn，这只是因为 XGBoost 干净地处理空值的方式。

但是，我的任务是向非技术人员介绍机器学习，我认为最好采用单树分类器的想法并讨论 XGBoost 一般如何采用该数据结构并“将其放在类固醇上”。具体来说，我想绘制这个单树分类器来显示切点。

指定n_estimators=1 会大致等同于使用scikit 的DecisionTreeClassifier吗？

【问题讨论】：

AFAIK 它是，但为什么不尝试并提供一个例子，就像这里：Why is Random Forest with a single tree much better than a Decision Tree classifier? 否则，这听起来像一个理论问题，因此不完全适合 SO... 当然，为什么不:) 除非我的调查结果有任何错误，否则它们看起来是一样的。确实；不太确定 reg_lambda 的确切含义 - 也许这也应该设置为 0（参见 this discussion）？现在，我认真地建议您接受更新并将其发布为您最初问题的答案... :) 【参考方案1】：

import subprocess

import numpy as np
from xgboost import XGBClassifier, plot_tree

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn import metrics

import matplotlib.pyplot as plt

RANDOM_STATE = 100
params = 
    'max_depth': 5,
    'min_samples_leaf': 5,
    'random_state': RANDOM_STATE


X, y = make_classification(
    n_samples=1000000,
    n_features=5,
    random_state=RANDOM_STATE
)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=RANDOM_STATE)

# __init__(self, max_depth=3, learning_rate=0.1,
# n_estimators=100, silent=True,
# objective='binary:logistic', booster='gbtree',
# n_jobs=1, nthread=None, gamma=0,
# min_child_weight=1, max_delta_step=0,
# subsample=1, colsample_bytree=1, colsample_bylevel=1,
# reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
# base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
xgb_model = XGBClassifier(
    n_estimators=1,
    max_depth=3,
    min_samples_leaf=5,
    random_state=RANDOM_STATE
)

# __init__(self, criterion='gini',
# splitter='best', max_depth=None,
# min_samples_split=2, min_samples_leaf=1,
# min_weight_fraction_leaf=0.0, max_features=None,
# random_state=None, max_leaf_nodes=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# class_weight=None, presort=False)
sk_model = DecisionTreeClassifier(
    max_depth=3,
    min_samples_leaf=5,
    random_state=RANDOM_STATE
)

xgb_model.fit(Xtrain, ytrain)
xgb_pred = xgb_model.predict(Xtest)

sk_model.fit(Xtrain, ytrain)
sk_pred = sk_model.predict(Xtest)

print(metrics.classification_report(ytest, xgb_pred))
print(metrics.classification_report(ytest, sk_pred))

plot_tree(xgb_model, rankdir='LR'); plt.show()

export_graphviz(sk_model, 'sk_model.dot'); subprocess.call('dot -Tpng sk_model.dot -o sk_model.png'.split())

一些性能指标（我知道，我没有完全校准分类器）...

>>> print(metrics.classification_report(ytest, xgb_pred))
              precision    recall  f1-score   support

           0       0.86      0.82      0.84    125036
           1       0.83      0.87      0.85    124964

   micro avg       0.85      0.85      0.85    250000
   macro avg       0.85      0.85      0.85    250000
weighted avg       0.85      0.85      0.85    250000

>>> print(metrics.classification_report(ytest, sk_pred))
              precision    recall  f1-score   support

           0       0.86      0.82      0.84    125036
           1       0.83      0.87      0.85    124964

   micro avg       0.85      0.85      0.85    250000
   macro avg       0.85      0.85      0.85    250000
weighted avg       0.85      0.85      0.85    250000

还有一些图片：

因此，除非有任何调查错误/过度概括，否则带有一个估计器的XGBClassifier（我假设是回归器）似乎与具有相同共享参数的 scikit-learn DecisionTreeClassifier 相同 .

【讨论】：

【参考方案2】：

如果您输入n_estimators=1，这正是决策树的工作方式。有几种分割节点的方法（如 gini-index 和 entropy），我不确定 scikit-learn 使用哪一种以及 xgboost 使用哪一种，但它没关系。

您想展示构建决策树的核心功能和深刻想法。我推荐帕特里克·温斯顿教授的following Lecture。我自己用它来向我的同行演示决策树是如何工作的，而且效果很好。

然后，您可以将 Boosting 的想法添加到组合中。 Patrick 也在 in here 演讲。

【讨论】：

【参考方案3】：

设置 XGBoost n_estimators=1 使算法生成一棵树（基本上不发生提升），这类似于 sklearn 的单树算法 - DecisionTreeClassifier。

但是，可以调整的超参数和树的生成过程在两者中是不同的。虽然 sklearn DecisionTreeClassifier 允许您调整比 xgboost 更多的超参数，但 xgboost 在超参数调整后会产生更好的准确性。 xgboost 生成的单棵树优于 sklearn DecisionTreeClassifier 生成的单棵树。

xgboost 的另一个优点是它自己处理缺失值。在 DecisionTreeClassifier 中，我们必须明确定义一个函数来处理可能产生不同结果的缺失值。

所以，在 sklearn DecisionTreeClassifier 上选择 n_estimators=1 的 xgboost！

【讨论】：

以上是关于XGBoost - n_estimators = 1 等于单树分类器？的主要内容，如果未能解决你的问题，请参考以下文章

XGBoost - 输出提升回合

gbdt和xgboost api

机器学习分类算法之XGBoost（集成学习算法）

GBDT与XGBoost速度对比