如何使用 predict_proba 获得每个样本的所有类的独立概率？

Posted 2023-03-12

技术标签:

【中文标题】如何使用 predict_proba 获得每个样本的所有类的独立概率？【英文标题】：How to get independent probabilities of all classes for each sample with predict_proba? 【发布时间】：2021-05-08 02:35:32 【问题描述】：

在我的工作中，有一个完全由布尔数据组成的特征集，并且有属于这些特征的类。类是字符串。

     feature set              class (String)
[True False True   ...]        "A"
[True True  True   ...]        "B"
[True True  False   ...]       "C"

当我使用随机森林算法训练这些数据时，

factor = pd.factorize(classes)
classes = factor[0]

classifier = RandomForestClassifier(n_estimators=100, criterion="entropy", random_state=0)
classifier.fit(x_train, classes)

分类器可以正确检测 97% 的类别。当我这样做时

classifier.predict_proba(sample1_feature_set)

它给出了 sample1 的每个类别的相对概率。例如;喜欢

 [0.80    0.05    0.15]
   ↓        ↓        ↓
  Prob.    Prob.    Prob.
   of       of       of
  "A"      "B"      "C" 
  for      for      for
sample1   sample1  sample1

所以当我将list(0.80 + 0.05 + 0.15)的值相加时，结果总是1。这说明它实际上是在进行相对评估，即一个类的概率影响另一类的概率。

我想得到sample1所有类的独立概率，比如

 [0.95    0.69    0.87]
   ↓        ↓        ↓
  Prob.    Prob.    Prob.
   of       of       of
  "A"      "B"      "C" 
  for      for      for
sample1   sample1  sample1

Sample1 是“A”类的 %95、“B”类的 %69 和“C”类的 %87。你知道我该怎么做吗？

【问题讨论】：

【参考方案1】：

predict_prob 计算每个类一个样本的概率。 [0.95 0.05] 表示在模型的 95% 的决策树中，这些唯一样本的输出为 0 类； 5% 属于 1 类。因此，您正在逐个评估每个样本。

当你这样做时：

classifier.predict_proba(example_feature_set)[0]

您正在获取example_feature_set 的第一个样本的每个类别的概率。

我认为您想要的是每个类的精度或召回率。（如果您不熟悉，请查看这些分数的含义）。

要计算这些，我推荐以下代码：

from sklearn.metrics import classification_report
y_pred=classifier.predict(example_feature_set) #I'm assuming you have more than one sample to predict
print(classification_report(y_test,y_pred))

然后你会得到一些可以帮助你的措施。

【讨论】：

感谢@Alex，但我想获得每个样本的所有类的独立概率。我编辑了帖子。哦，对不起，我误解了你的问题。现在，在我看来，你无法获得你想要的概率。让我们看看是否有人可以提供更多帮助:)【参考方案2】：

随机森林是ensemble method。基本上，它使用不同的数据子集（称为装袋）构建单独的决策树，并对所有树的预测进行平均，从而为您提供概率。帮助页面实际上是一个很好的起点：

在平均方法中，驱动原理是建立几个估计者独立，然后平均他们的预测。在平均而言，组合估计器通常优于任何单基估计，因为它的方差减少了。

示例：Bagging 方法、随机树的森林……

因此，概率的总和总是为 1。以下是您如何访问每棵树的单独预测的示例：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)
model.fit(X_train, y_train)

pred = model.predict_proba(X_test)
pred[:5,:]

array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1]])

这是对第一棵树的预测：

model.estimators_[0].predict(X_test)
Out[42]: 
array([1., 0., 2., 2., 1., 0., 1., 2., 2., 1., 2., 0., 0., 0., 0., 2., 2.,
       1., 1., 2., 0., 2., 0., 2., 2., 2., 2., 2., 0., 0., 0., 0., 1., 0.,
       0., 2., 1., 0., 0., 0., 2., 2., 1., 0., 0., 1., 1., 2., 1., 2.])

我们对所有树进行计数：

result = np.zeros((len(X_test),3))
for i in range(len(model.estimators_)):
    p = model.estimators_[i].predict(X_test).astype(int)
    result[range(len(X_test)),p] += 1

result[:5,:]
Out[63]: 
array([[ 0., 10.,  0.],
       [10.,  0.,  0.],
       [ 0.,  0., 10.],
       [ 0.,  9.,  1.],
       [ 0.,  9.,  1.]])

将其除以树的数量得出您之前获得的概率：

result/10
Out[65]: 
array([[0. , 1. , 0. ],
       [1. , 0. , 0. ],
       [0. , 0. , 1. ],
       [0. , 0.9, 0.1],
       [0. , 0.9, 0.1],

【讨论】：

谢谢@StupidWolf 我想不可能在随机森林算法中获得每个样本的所有类的独立概率。好的，你有没有其他合适的算法来获得它？逻辑回归？您可以获得对数赔率。这完全是一个单独的问题，真的不清楚你想要什么

以上是关于如何使用 predict_proba 获得每个样本的所有类的独立概率？的主要内容，如果未能解决你的问题，请参考以下文章