使用 Vowpal Wabbit 获得未校准的概率输出，广告转化预测

Posted 2023-03-13

技术标签:

【中文标题】使用 Vowpal Wabbit 获得未校准的概率输出，广告转化预测【英文标题】：Getting uncalibrated probability outputs with Vowpal Wabbit, ad-conversion prediction 【发布时间】：2016-10-08 17:17:51 【问题描述】：

我正在尝试使用 Vowpal Wabbit 来预测广告展示的转化率，并且我得到了非直观的概率输出，当正类的全球频率小于 1% 时，这些概率输出集中在 36% 左右。

我的数据集中的正负不平衡是 1/100（我已经对负类进行了欠采样），所以我在正样本中使用了 100 的权重。

负例的标签为-1，正例的标签为1。我使用shuf将正例和负例打乱，以便在线学习正常工作。

vw 文件中的示例行：

1 100 'c4ac3440|i search_delay_log:3.58351893846 click_count_log:3.58351893846 banner_impression_count_log:3.98898404656 |c es i_type_2 xvertical_1_61 vertical_1 creat_size_728x90 retargeting
-1 1 'a4d25cf1|i search_delay_log:11.2825684591 click_count_log:11.2825684591 banner_impression_count_log:4.48863636973 |c br i_type_1 xvertical_1_960 vertical_1 creat_size_300x600 retargeting

现在我使用以下内容从训练集中创建模型：

vw -d impressions_rand.aa --loss_function logistic -c -k --passes 12 -f model.vw

输出：

final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.0000       11
0.510760 0.328374            2            2.0  -1.0000  -0.9449       11
0.387521 0.264282            4            4.0  -1.0000  -1.1825       11
1.765374 1.818883            8          107.0   1.0000  -1.7020       11
2.152669 2.444504           51          249.0   1.0000  -3.2953       11
1.289870 0.427071          201          498.0  -1.0000  -3.5498       11
0.878843 0.528943          588         1083.0   1.0000  -1.3394        9
0.852358 0.825872         1176         2166.0  -1.0000  -6.7918       11
0.871977 0.891597         2451         4332.0  -1.0000  -2.7031       11
0.689428 0.506878         4110         8664.0  -1.0000  -2.7525       11
0.638008 0.586589         8517        17328.0  -1.0000  -5.8017       11
0.580220 0.522713        17515        34741.0   1.0000   2.1519       11
0.526281 0.472343        35525        69482.0  -1.0000  -6.2931        9
0.497601 0.468921        71050       138964.0  -1.0000  -7.6245        9
0.479305 0.461008       143585       277928.0  -1.0000  -0.8296       11
0.443734 0.443734       288655       555856.0  -1.0000  -2.5795       11 h
0.438806 0.433925       578181      1111791.0   1.0000   0.8503       11 h

finished run
number of examples per pass = 216000
passes used = 5
weighted example sum = 2072475.000000
weighted label sum = -67475.000000
average loss = 0.432676 h
best constant = -0.065138
best constant's loss = 0.692617
total feature number = 11548690

现在在测试集上进行预测。 --link logistic 应该将 vw 输出转换为 [0, 1] 范围内的概率。

vw -d impressions_rand.ab --link logistic -i model.vw -p preds_ab.txt

输出：

predictions = preds_ab.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
68.282379 68.282379            1            1.0  -1.0000   0.0001        9
38.748867 9.215355            2            2.0  -1.0000   0.0174       11
21.256140 3.763414            4            4.0  -1.0000   0.8345       11
11.685329 2.114518            8            8.0  -1.0000   0.3508       11
9.457854 7.230378           16           16.0  -1.0000   0.0069       11
7.371087 5.284320           32           32.0  -1.0000   0.3561       11
7.061980 6.752873           64           64.0  -1.0000   0.6549       11
5.423309 3.784638          128          128.0  -1.0000   0.2597       11
3.252394 1.725597          211          310.0   1.0000   0.7686       11
2.140099 1.052366          330          627.0   1.0000   0.7143       11
1.671550 1.203000          660         1254.0  -1.0000   0.8054       11
1.788466 1.905383         1320         2508.0  -1.0000   0.0676        9
1.508163 1.234410         2502         5076.0   1.0000   0.3921       11
1.282862 1.060063         5061        10209.0   1.0000   0.4258        9
1.119420 0.955977        11013        20418.0  -1.0000   0.6892       11
1.017911 0.916403        22323        40836.0  -1.0000   0.5301        9
0.888435 0.758960        42171        81672.0  -1.0000   0.3500       11
0.787709 0.686983        84243       163344.0  -1.0000   0.2360        9
0.703270 0.618831       170268       326688.0  -1.0000   0.5707       11

finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 397936.000000
weighted label sum = -12936.000000
average loss = 0.684043
best constant = -0.032508
best constant's loss = 0.998943
total feature number = 2216941

这会输出一个预测文件preds_ab.txt，例如：

0.000095 7c14ae23
0.017367 3e9558bd
0.139393 6a1cd72f
0.834518 dfe76f6e
0.089810 2b88b547

如果我计算这些预测的 ROC-AUC 分数，我得到的值是 0.85，这与我使用 scikit-learn (0.90) 得到的值很接近。然而，概率输出根本没有校准，因为它们远高于我的预期（接近 1%）。这是直方图。

这是可靠性曲线：

这是当示例按概率分箱时的平均概率和正频率图：

很明显，输出概率远高于经过良好校准的分类器的预期。

我在这里做错了什么？我应该调查什么？

更新

如果我不对正类示例使用 100 权重，我会得到类似的非直观结果。平均概率输出为 0.27（与 1 相差甚远），可靠性图看起来更差，ROC-AUC 为 0.76。

我可以确认我有 237805 个负例和 2195 个正例。

输出训练：

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.0000       11
0.546724 0.400300            2            2.0  -1.0000  -0.7087       11
0.398553 0.250382            4            4.0  -1.0000  -1.3963       11
0.284506 0.170460            8            8.0  -1.0000  -2.2595       11
0.181406 0.078306           16           16.0  -1.0000  -2.8225       11
0.108136 0.034865           32           32.0  -1.0000  -4.2696       11
0.063156 0.018176           64           64.0  -1.0000  -4.7412       11
0.036415 0.009675          128          128.0  -1.0000  -4.2940       11
0.020325 0.004235          256          256.0  -1.0000  -5.9903       11
0.043248 0.066171          512          512.0  -1.0000  -5.5540       11
0.045276 0.047304         1024         1024.0  -1.0000  -4.7065       11
0.044606 0.043935         2048         2048.0  -1.0000  -6.6253       11
0.048938 0.053270         4096         4096.0  -1.0000  -5.9119       11
0.048711 0.048485         8192         8192.0  -1.0000  -2.3949       11
0.048157 0.047603        16384        16384.0  -1.0000  -9.6219       11
0.044306 0.040454        32768        32768.0  -1.0000  -8.8800       11
0.044029 0.043752        65536        65536.0  -1.0000  -5.9218        9
0.042739 0.041450       131072       131072.0  -1.0000  -3.8306       11
0.042986 0.042986       262144       262144.0  -1.0000  -6.0941       11 h
0.042321 0.041655       524288       524288.0  -1.0000  -4.0276       11 h
0.042654 0.042988      1048576      1048576.0  -1.0000  -9.9169       11 h

finished run
number of examples per pass = 216000
passes used = 7
weighted example sum = 1512000.000000
weighted label sum = -1484504.000000
average loss = 0.042763 h
best constant = -4.691161
best constant's loss = 0.051789
total feature number = 16166472

输出测试如下。我读到平均损失大于最佳恒定损失表明我的模型学习有问题。

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
78.141266 78.141266            1            1.0  -1.0000   0.0001       11
54.228148 30.315029            2            2.0  -1.0000   0.0015       11
33.279501 12.330854            4            4.0   1.0000   0.0472       11
20.358767 7.438034            8            8.0  -1.0000   0.0527       11
15.780043 11.201319           16           16.0  -1.0000   0.1657       11
13.783271 11.786498           32           32.0  -1.0000   0.0012        9
9.318714 4.854158           64           64.0  -1.0000   0.7268       11
6.797651 4.276587          128          128.0  -1.0000   0.1404        9
4.674237 2.550824          256          256.0  -1.0000   0.0516       11
3.269198 1.864159          512          512.0  -1.0000   0.4092       11
2.153033 1.036868         1024         1024.0  -1.0000   0.0425       11
1.481920 0.810807         2048         2048.0  -1.0000   0.2792       11
1.005869 0.529817         4096         4096.0  -1.0000   0.2422       11
0.676574 0.347279         8192         8192.0  -1.0000   0.3003       11
0.452924 0.229274        16384        16384.0  -1.0000   0.2579       11
0.295262 0.137600        32768        32768.0  -1.0000   0.2833       11
0.191513 0.087763        65536        65536.0  -1.0000   0.2616        9
0.126758 0.062003       131072       131072.0  -1.0000   0.2670       11

finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.099565
best constant = -0.981009
best constant's loss = 0.037621
total feature number = 2217159

【问题讨论】：

我要改进此结果的第一件事是避免哈希冲突：您有超过 200k 的示例，并且大约 10 倍以上的功能（每个示例约 10 个功能）。保留默认的-b 18（大约 262k 独特功能）似乎是不够的。尝试-b 24 作为开始。它会改善结果吗？另外：除非有一些严重的违规行为使得正面标签一起出现，否则没有必要打乱按自然时间顺序出现的示例。测试时也应该使用-t，这样就不用继续对测试数据进行训练了。更新后的测试真的很奇怪：正如您所注意到的，平均损失比最佳常数的损失更差。如果省略--link=logistic，测试损失是多少？（它应该保持不变，我会说，但我不确定，现在不能尝试。）这与问题无关，但我认为应该给予下采样类补偿权重。 【参考方案1】：

您说在训练集中平均每 100 个负例就有一个正例。但是，您在正例上增加了 100 倍的权重，这（几乎）相当于在训练集中将每个正例重复 100 次。这样，平均预测概率应该在 50% 左右。所以你不应该对它不在 1% 左右感到惊讶。

根据您提供的 vw 输出，训练集impressions_rand.aa 中每个正例似乎有100 多个负例，因此“加权标签总和”为负例（否则应该在0 左右）。因此，平均预测概率不是 50%，而是 36% 左右。

【讨论】：

感谢您的回答，但不幸的是，不加权正例也会输出不可靠的概率。我已经用没有加权的训练/测试输出更新了这个问题。【参考方案2】：

感谢 Martin Popel 和 arielf cmets，我解决了这个问题。 :)

-t

--loss_function logisitc

因此，模型在使用默认损失函数而不是逻辑损失函数进行测试时被更新，从而破坏了模型并产生了错误的结果。

外卖：

--loss_function logistic

-t

这是现在测试时输出的样子（没有示例加权）：

$ vw -d impressions_rand.ab --link logistic --loss_function logistic -i model.vw -t -p preds.txt
only testing
predictions = preds.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000053 0.000053            1            1.0  -1.0000   0.0001       11
0.000370 0.000687            2            2.0  -1.0000   0.0007       11
1.252868 2.505366            4            4.0   1.0000   0.0067       11
0.638249 0.023630            8            8.0  -1.0000   0.0036       11
0.322060 0.005872           16           16.0  -1.0000   0.0031       11
0.164750 0.007439           32           32.0  -1.0000   0.0000        9
0.084911 0.005072           64           64.0  -1.0000   0.0081       11
0.076905 0.068899          128          128.0  -1.0000   0.0004        9
0.055126 0.033347          256          256.0  -1.0000   0.0000       11
0.052986 0.050847          512          512.0  -1.0000   0.0133       11
0.038351 0.023715         1024         1024.0  -1.0000   0.0000       11
0.037059 0.035767         2048         2048.0  -1.0000   0.0167       11
0.038848 0.040637         4096         4096.0  -1.0000   0.0112       11
0.038903 0.038957         8192         8192.0  -1.0000   0.0281       11
0.041625 0.044348        16384        16384.0  -1.0000   0.0001       11
0.042526 0.043426        32768        32768.0  -1.0000   0.0218       11
0.042538 0.042551        65536        65536.0  -1.0000   0.0000        9
0.042150 0.041763       131072       131072.0  -1.0000   0.0019       11

finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.042438
best constant = -4.647395
best constant's loss = 0.053670
total feature number = 2217159

您看到现在报告的average loss 小于best constant's loss，并且迭代平均损失也在预期区间内。

此外，输出概率现在非常有意义：

【讨论】：

以上是关于使用 Vowpal Wabbit 获得未校准的概率输出，广告转化预测的主要内容，如果未能解决你的问题，请参考以下文章

Vowpal Wabbit中逻辑回归的正确性？

如何将 Vowpal Wabbit 逻辑预测转换为概率

Vowpal Wabbit - 精确召回 f 测量

vowpal-wabbit：使用多次通过、保持和保持期来避免过度拟合？

Vowpal Wabbit Logistic 回归

Vowpal Wabbit 的梯度提升