SHAP:shap_values 计算中的 XGBoost 和 LightGBM 差异
Posted
技术标签:
【中文标题】SHAP:shap_values 计算中的 XGBoost 和 LightGBM 差异【英文标题】:SHAP: XGBoost and LightGBM difference in shap_values calculation 【发布时间】:2022-01-23 18:40:58 【问题描述】:我在 Visual Studio 代码中有这段代码:
import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, cross_val_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, accuracy_score
df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=22)
xgb_model = xgb.XGBClassifier(eval_metric='mlogloss',use_label_encoder =False)
xgb_fitted = xgb_model.fit(X_train, y_train)
explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")
当我运行这段代码时,我收到了这个错误:
Summary plots need a matrix of shap_values, not a vector.
在shap.summary_plot
线上。
问题是什么,我该如何解决?
以上代码基于此代码示例:https://github.com/slundberg/shap。
数据集如下:
Cat1,Cat2,Age,Cat3,Cat4,target
0,0,18,1,0,1
0,0,17,1,0,1
0,0,15,1,1,1
0,0,15,1,0,1
0,0,16,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,17,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,0,15,1,0,1
0,0,15,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,1,17,1,0,0
0,1,16,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,15,0,0,1
0,0,16,1,0,1
0,1,15,1,0,1
请注意,实际数据有 700 行,但我复制了一小部分只是为了展示数据的样子。
编辑 1
这个问题的主要原因是要了解在使用不同的分类时应该如何更改代码。
我最初有一个带有 lgmb 的示例代码,但当我将其更改为 xgboost 时,它会在摘要图上产生错误。
为了说明我的意思,我开发了以下示例代码:
import pandas as pd
import shap
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import train_test_split
df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
# select one of the two models
model = xgb.XGBClassifier()
#model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.Explainer(model_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")
如果我使用 LGBM 模型,它运行良好,如果我使用 XGBoost,它会失败。有什么区别以及我应该如何更改 XGBoost 行为类似于 LGBM 和应用程序工作的代码。
【问题讨论】:
什么是mydataset.csv
?
@SergeyBushmanov 该文件包含具有一些特征和目标的示例数据集。我不认为问题出在数据集上。
@mans 你能提供一些示例数据吗?可能会发生此错误,因为 shap_values
可能是 (m, n) 维,shap_values[1]
可能是一维向量。
@mans 我熟悉 DS 流程。问题是“显示您的数据”,或提供minimal reproducible example。附注:target=df['target']
应为 target=df.pop['target']
。请向我们展示您的数据。
@MiguelTrejo,请在编辑后的问题中找到所需的信息。需要注意的是,如果我使用shap.plot.bar(),那么我没有任何问题。
【参考方案1】:
请注意,使用summary_plot()
,您想要可视化总体上哪些特征对模型更重要,因此它需要一个矩阵
对于单输出解释,这是一个 SHAP 值矩阵(# 个样本 x # 个特征)。
shap_values = explainer.shap_values(X_test)
的结果是一个形状为 (n_samples, 5)
的矩阵(样本数据中的列)。
当您获取第一个样本时,shap_values[0]
是一个解释第一个预测特征贡献的向量,这就是 Summary plots need a matrix of shap_values, not a vector.
提高的原因。
如果您想可视化单个预测 shap_values[0]
,您可以使用 force_plot
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0])
编辑
两个模型的输出之间的差异在于out
结果的计算方式。检查lightgbm
计算的源代码一旦计算了变量phi
,它就可以通过以下方式concatenates the values
phi = np.concatenate((0-phi, phi), axis=-1)
生成形状数组(n_samples, n_features*2)
。
这个形状不同于X_test
,也就是phi.shape[1] != X.shape[1] + 1
,所以它@987654324@是两个三维数组
phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)
最后输出是list of length two
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
out
>>>
[array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
...
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
...
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])]
请参阅下面的示例,了解 out
计算的不同之处。
以LightGBM
为例
import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb
import shap.explainers as explainers
from sklearn.model_selection import train_test_split
df = pd.read_csv("test_data.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)
model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)
# Calculate phi from https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L347
tree_limit = -1 if explainer.model.tree_limit is None else explainer.model.tree_limit
phi = explainer.model.original_model.predict(X_test, num_iteration=tree_limit, pred_contrib=True)
# Objective is binary: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L349
if explainer.model.original_model.params['objective'] == 'binary':
phi = np.concatenate((0-phi, phi), axis=-1)
# Phi shape is different from X_test:
if phi.shape[1] != X_test.shape[1] + 1:
phi = phi.reshape(X_test.shape[0], phi.shape[1]//(X_test.shape[1]+1), X_test.shape[1]+1)
# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L370
expected_value = [phi[0, i, -1] for i in range(phi.shape[1])]
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
expected_value
>>> [-0.8109302162163288, 0.8109302162163288]
out
>>>
[array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])]
XGBoost 示例
import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb
import shap.explainers as explainers
from sklearn.model_selection import train_test_split
df = pd.read_csv("test_data.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)
model = xgb.XGBClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)
# Transform data to DMatrix: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L326
if not isinstance(X_test, xgb.core.DMatrix):
X_test = xgb.DMatrix(X_test)
tree_limit = explainer.model.tree_limit
# Calculate phi: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L331
phi = explainer.model.original_model.predict(
X_test, ntree_limit=tree_limit, pred_contribs=True,
approx_contribs=False, validate_features=False
)
# Model output is "raw": https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L339
model_output_vals = explainer.model.original_model.predict(
X_test, ntree_limit=tree_limit, output_margin=True,
validate_features=False
)
model_output_vals
>>> array([-0.11323176, -0.11323176, 0.5436669 , 0.87637275, 1.5332711 ,
-0.11323176, 1.5332711 , 0.5436669 , 1.5332711 , 0.5436669 ,
0.87637275, 0.87637275, -0.11323176, 0.5436669 ], dtype=float32)
# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L374
expected_value_ = phi[0, -1]
expected_value_
>>> 0.817982
out_ = phi[:, :-1]
out_
>>>
array([[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , 0.4087782 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, 0.4087782 , 0. , 0. ],
[ 0. , -0.35038763, -0.5808259 , 0. , 0. ],
[ 0. , 0.3065111 , -0.5808259 , 0. , 0. ]],
dtype=float32)
【讨论】:
请查看更新后的问题。为什么相同的代码适用于 LGBM 而不是 XGBoost? @mans 检查源代码,lightgbm
和 xgboost
的结果计算方式有所不同,例如,请参阅更新的答案。【参考方案2】:
假设您已从上述问题中复制了数据,则可以执行以下操作:
import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import (
train_test_split,
StratifiedKFold,
cross_validate,
cross_val_score,
)
from sklearn.metrics import (
classification_report,
ConfusionMatrixDisplay,
accuracy_score,
)
df = pd.read_clipboard(sep=",")
target = df.pop("target")
X_train, X_test, y_train, y_test = train_test_split(
df, target, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
xgb_model = xgb.XGBClassifier(eval_metric="mlogloss", use_label_encoder=False)
xgb_fitted = xgb_model.fit(X_train, y_train)
explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
# shap.summary_plot(shap_values, X_test, plot_type="bar")
您粘贴的代码假定每个类别“0”和“1”有 2 个 ["identical"] shap 值数组。自打印以来,explainer.shap_values
计算 XGBoost
的 SHAP 值的方式发生了一些变化。所以,现在提供shap_values
(没有类索引)就足够了。
【讨论】:
谢谢,但是剧情显示了什么?是 0 级还是 1 级?如果我对 xgboost 使用相同的绘图命令,我会得到解释每个类的不同类型的绘图。 这些适用于“1”类。假设您输入与我相同的代码,我们应该有相同的图。 如何获取 0 类的信息? 它们在一个符号上是相同的:增加 1 的机会减少 0 的数量相同(我猜是由于可加性)。 @mans 我相信原来的问题已经像以前一样得到了回答,代码现在可以工作并解释了原因。如果您想要更多,您可以考虑再开一家。以上是关于SHAP:shap_values 计算中的 XGBoost 和 LightGBM 差异的主要内容,如果未能解决你的问题,请参考以下文章
Py之shap:shap.explainers.shap_values函数的简介解读(shap_values[1]索引为1的原因)使用方法之详细攻略
用于 RandomForest 多类的 SHAP TreeExplainer:啥是 shap_values[i]?
ClipByValue 不存在于 Shap 包的 tf_ops._gradient_registry._registry