搜索广告 - 不平衡数据 Imbalanced Data

Posted change_world

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了搜索广告 - 不平衡数据 Imbalanced Data相关的知识,希望对你有一定的参考价值。

【IJCAI-2018】搜索广告 - 不平衡数据 Imbalanced Data

我并不擅长做比赛,也不擅长构造特征,也不擅长调参数,也没有服务器可以并行。大家的baseline都比我的模型要好。在这里写这篇文章,主要是想跟大家分享下我对数据的理解,以及我思考的一个大概框架,希望对大家能有那么一点点启发或者帮助。

 

像我这种无经验无战绩无队友,特征只会弄个dummy variable,降维只会PCA,模型只会LR,SVM,调参只会CV,ensemble只会求平均的人,每次在比赛里的存在感就是增大分母,当我看到大家在论坛分享自己的baseline的时候,真的是好高兴好兴奋,然后又看到大家在构造各种神奇的特征,模型的logloss居然有提高,真的是很佩服。由于我既没有聪明的头脑也没有足够的细致,于是就“拿来主义”至上,将论坛里看到的baseline copy下来,在电脑上跑了一下。哇,好牛,第一把就0.084,与排名最前的0.080,只差0.004,我这是要冲击leaderboard 的节奏啊~兴奋之余干劲更大了,各种吭哧吭哧搜模型,吭哧吭哧敲代码,感觉自带BGM,走路生风~反正就是那种全世界都属于我的感觉~但是~~当我想看看哪些预测为1的时候,我惊呆了,no one! 合着我1万8的test data,用模型预测出来之后,竟然没有一个1,一个都没有!

技术分享图片

然后这就是问题了。可能聪明的大家早就知道了CTR的数据不平衡问题,但是愚钝如我啊,我竟然没有发现!

所以吐槽完了~

 

对于不平衡数据 Imbalanced Data,像这里的CTR里面的二分类预测,应该怎么处理呢?

正负样本比例严重不平衡的情况,比例达到了50:1,如果直接在此基础上做预测,对于样本量较小的类的召回率会极低。

 

因为传统的学习方法以降低总体分类精度为目标,将所有样本一视同仁,同等对待,造成了分类器在多数类的分类精度较高而在少数类的分类精度很低。例如ctr正负样本50:1的例子,算法就算全部预测为另一样本,准确率也会达到98%(50/51),因此传统的学习算法在不平衡数据集中具有较大的局限性。传统的学习算法的预测结果就是favor the majority, 因为the minority 本身数量少,又本同等对待,因此miss the minority 的代价极小,所以结果就是favor the majority。

 

解决方法主要分为两个方面。

第一种方案主要从数据的角度出发,主要方法为抽样,既然我们的样本是不平衡的,那么可以通过某种策略进行抽样,从而让我们的数据相对均衡一些;resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.

 

第二种方案从算法的角度出发,考虑不同误分类情况代价的差异性对算法进行优化,使得我们的算法在不平衡数据下也能有较好的效果。改写cost function by giving large cost of misclassifying the minority labels. 

PS: 附件中有基于logloss , AUC 的对比的python代码,可以运行,不会memory error.

技术分享图片

 

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Wed Apr 4 10:53:58 2018
  4 @author : HaiyanJiang
  5 @email : [email protected]
  6 
  7  
  8 
  9 what does the doc do?
 10 some ideas of improving the accuracy of imbalanced data classification.
 11 data characteristics:
 12 imbalanced data.
 13 the models:
 14 model_baseline : lgb
 15 model_baseline2 : another lgb
 16 model_baseline3 : bagging
 17 
 18  
 19 
 20 Other Notes:
 21 除了基本特征外,还包括了用户在当前小时内和当天的点击量统计特征,以及当前所在的小时。
 22 context_day, context_hour,
 23 user_query_day, user_query_hour, user_query_day_hour,
 24 non_feat = [
 25 instance_id, user_id, context_id, item_category_list,
 26 item_property_list, predict_category_property,
 27 context_timestamp, TagTime, context_day
 28 ]
 29 
 30  
 31 
 32 """
 33 
 34  
 35 
 36 import time
 37 import pandas as pd
 38 import lightgbm as lgb
 39 from sklearn.metrics import log_loss
 40 
 41  
 42 
 43 import numpy as np
 44 import itertools
 45 import matplotlib.pyplot as plt
 46 from sklearn.metrics import confusion_matrix
 47 from sklearn.metrics import auc, roc_curve
 48 from scipy import interp
 49 
 50  
 51 
 52 from sklearn.ensemble import BaggingClassifier
 53 from imblearn.ensemble import BalancedBaggingClassifier
 54 
 55  
 56 
 57 
 58 def read_bigcsv(filename, **kw):
 59 with open(filename) as rf:
 60 reader = pd.read_csv(rf, **kw, iterator=True)
 61 chunkSize = 100000
 62 chunks = []
 63 while True:
 64 try:
 65 chunk = reader.get_chunk(chunkSize)
 66 chunks.append(chunk)
 67 except StopIteration:
 68 print("Iteration is stopped.")
 69 break
 70 df = pd.concat(chunks, axis=0, join=outer, ignore_index=True)
 71 return df
 72 
 73  
 74 
 75 
 76 def timestamp2datetime(value):
 77 value = time.localtime(value)
 78 dt = time.strftime(%Y-%m-%d %H:%M:%S, value)
 79 return dt
 80 
 81  
 82 
 83 
 84 ‘‘‘
 85 from matplotlib import pyplot as plt
 86 tt = data[context_timestamp]
 87 plt.plot(tt)
 88 # 可以看出时间是没有排好的,有一定的错位。如果做成online的模型,一定要将时间排好。
 89 # aa = data[data[user_id]==24779788309075]
 90 aa = data_train[data_train.duplicated(subset=None, keep=first)]
 91 bb = data_train[data_train.duplicated(subset=None, keep=last)]
 92 cc = data_train[data_train.duplicated(subset=None, keep=False)]
 93 
 94  
 95 
 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)]
 97 b2 = train_id[train_id.duplicated(keep=last)]
 98 c2 = train_id[train_id.duplicated(keep=False)]
 99 
100  
101 
102 c2 = data_train[data_train.duplicated(subset=None, keep=False)]
103 
104  
105 
106 经验证, instance_id有重复
107 a3 = Xdata[Xdata[instance_id]==1037061371711078396]
108 ‘‘‘
109 
110  
111 
112 
113 def convert_timestamp(data):
114 ‘‘‘
115 1. convert timestamp to datetime.
116 2. no sort, no reindex.
117 data.duplicated(subset=None, keep=first)
118 TagTime from-to is (2018-09-18 00:00:01, 2018-09-24 23:59:47)
119 user_query_day, user_query_day_hour, hour,
120 np.corrcoef(data[user_query_day], data[user_query_hour])
121 np.corrcoef(data[user_query_hour], data[user_query_day_hour])
122 np.corrcoef(data[user_query_day], data[user_query_day_hour])
123 ‘‘‘
124 data[TagTime] = data[context_timestamp].apply(timestamp2datetime)
125 # data[TagTime][0], data[TagTime][len(data) - 1]
126 # x = data[TagTime][len(data) - 1]
127 data[context_day] = data[TagTime].apply(lambda x: int(x[8:10]))
128 data[context_hour] = data[TagTime].apply(lambda x: int(x[11:13]))
129 query_day = data.groupby([user_id, context_day]).size(
130 ).reset_index().rename(columns={0: user_query_day})
131 data = pd.merge(data, query_day, left, on=[user_id, context_day])
132 query_hour = data.groupby([user_id, context_hour]).size(
133 ).reset_index().rename(columns={0: user_query_hour})
134 data = pd.merge(data, query_hour, left, on=[user_id, context_hour])
135 query_day_hour = data.groupby(
136 by=[user_id, context_day, context_hour]).size(
137 ).reset_index().rename(columns={0: user_query_day_hour})
138 data = pd.merge(data, query_day_hour, left,
139 on=[user_id, context_day, context_hour])
140 return data
141 
142  
143 
144 
145 def plot_confusion_matrix(cm, classes, normalize=False,
146 title=Confusion matrix,
147 cmap=plt.cm.Blues):
148 """
149 This function prints and plots the confusion matrix.
150 Normalization can be applied by setting normalize=True.
151 """
152 if normalize:
153 cm = cm.astype(float) / cm.sum(axis=1)[:, np.newaxis]
154 print("Normalized confusion matrix")
155 else:
156 print(Confusion matrix, without normalization)
157 print(cm)
158 plt.imshow(cm, interpolation=nearest, cmap=cmap)
159 plt.title(title)
160 plt.colorbar()
161 tick_marks = np.arange(len(classes))
162 plt.xticks(tick_marks, classes, rotation=45)
163 plt.yticks(tick_marks, classes)
164 fmt = .2f if normalize else d
165 thresh = cm.max() / 2.
166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
167 plt.text(j, i, format(cm[i, j], fmt),
168 horizontalalignment="center",
169 color="white" if cm[i, j] > thresh else "black")
170 plt.tight_layout()
171 plt.ylabel(True label)
172 plt.xlabel(Predicted label)
173 
174  
175 
176 
177 def data_baseline():
178 filename = ../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt
179 data = read_bigcsv(filename, sep= )
180 # data = pd.read_csv(filename, sep= )
181 data.drop_duplicates(inplace=True)
182 data.reset_index(drop=True, inplace=True) # very important
183 data = convert_timestamp(data)
184 train = data.loc[data[context_day] < 24] # 18,19,20,21,22,23,24
185 test = data.loc[data[context_day] == 24] # 暂时先使用第24天作为验证集
186 features = [
187 item_id, item_brand_id, item_city_id, item_price_level,
188 item_sales_level, item_collected_level, item_pv_level,
189 user_gender_id, user_age_level, user_occupation_id,
190 user_star_level, context_page_id, shop_id,
191 shop_review_num_level, shop_review_positive_rate,
192 shop_star_level, shop_score_service,
193 shop_score_delivery, shop_score_description,
194 user_query_day, user_query_day_hour, context_hour,
195 ]
196 x_train = train[features]
197 x_test = test[features]
198 y_train = train[is_trade]
199 y_test = test[is_trade]
200 return x_train, x_test, y_train, y_test
201 # x_train, x_test, y_train, y_test = data_baseline()
202 
203  
204 
205 
206 def model_baseline(x_train, y_train, x_test, y_test):
207 cat_names = [
208 item_price_level,
209 item_sales_level,
210 item_collected_level,
211 item_pv_level,
212 user_gender_id,
213 user_age_level,
214 user_occupation_id,
215 user_star_level,
216 context_page_id,
217 shop_review_num_level,
218 shop_star_level,
219 ]
220 print("begin train...")
221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,)
222 clf = lgb.LGBMClassifier(**kw_lgb)
223 clf.fit(x_train, y_train, categorical_feature=cat_names,)
224 prob = clf.predict_proba(x_test,)[:, 1]
225 predict_score = [float(%.2f % x) for x in prob]
226 loss_val = log_loss(y_test, predict_score)
227 # print(loss_val) # 0.0848226750637
228 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
229 mean_fpr = np.linspace(0, 1, 100)
230 mean_tpr = interp(mean_fpr, fpr, tpr)
231 x_auc = auc(fpr, tpr)
232 fig = plt.figure(fig1)
233 ax = fig.add_subplot(1, 1, 1)
234 name = base_lgb
235 plt.plot(mean_fpr, mean_tpr, linestyle=--,
236 label={} (area = %0.2f, logloss = %0.2f).format(name) %
237 (x_auc, loss_val), lw=2)
238 y_pred = clf.predict(x_test)
239 cm1 = plt.figure()
240 cm = confusion_matrix(y_test, y_pred)
241 plot_confusion_matrix(cm, classes=[0, 1], title=Confusion matrix base1)
242 # add weighted according to the labels
243 clf = lgb.LGBMClassifier(**kw_lgb)
244 clf.fit(x_train, y_train,
245 sample_weight=[1 if y == 1 else 0.02 for y in y_train],
246 categorical_feature=cat_names)
247 prob = clf.predict_proba(x_test,)[:, 1]
248 predict_score = [float(%.2f % x) for x in prob]
249 loss_val = log_loss(y_test, predict_score)
250 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
251 mean_fpr = np.linspace(0, 1, 100)
252 mean_tpr = interp(mean_fpr, fpr, tpr)
253 x_auc = auc(fpr, tpr)
254 name = base_lgb_weighted
255 plt.figure(fig1) # 选择图
256 plt.plot(
257 mean_fpr, mean_tpr, linestyle=--,
258 label={} (area = %0.2f, logloss = %0.2f).format(name) %
259 (x_auc, loss_val), lw=2)
260 y_pred = clf.predict(x_test)
261 cm2 = plt.figure()
262 cm = confusion_matrix(y_test, y_pred)
263 plot_confusion_matrix(cm, classes=[0, 1],
264 title=Confusion matrix basemodle)
265 plt.figure(fig1) # 选择图
266 plt.plot([0, 1], [0, 1], linestyle=--, lw=2, color=k, label=Luck)
267 # make nice plotting
268 ax.spines[top].set_visible(False)
269 ax.spines[right].set_visible(False)
270 ax.get_xaxis().tick_bottom()
271 ax.get_yaxis().tick_left()
272 ax.spines[left].set_position((outward, 10))
273 ax.spines[bottom].set_position((outward, 10))
274 plt.xlim([0, 1])
275 plt.ylim([0, 1])
276 plt.xlabel(False Positive Rate)
277 plt.ylabel(True Positive Rate)
278 plt.title(Receiver Operating Characteristic)
279 plt.legend(loc="lower right")
280 plt.show()
281 return cm1, cm2, fig
282 
283  
284 
285 
286 def model_baseline3(x_train, y_train, x_test, y_test):
287 bagging = BaggingClassifier(random_state=0)
288 balanced_bagging = BalancedBaggingClassifier(random_state=0)
289 bagging.fit(x_train, y_train)
290 balanced_bagging.fit(x_train, y_train)
291 prob = bagging.predict_proba(x_test)[:, 1]
292 predict_score = [float(%.2f % x) for x in prob]
293 loss_val = log_loss(y_test, predict_score)
294 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
295 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
296 mean_fpr = np.linspace(0, 1, 100)
297 mean_tpr = interp(mean_fpr, fpr, tpr)
298 x_auc = auc(fpr, tpr)
299 fig = plt.figure(Bagging)
300 ax = fig.add_subplot(1, 1, 1)
301 name = base_Bagging
302 plt.plot(mean_fpr, mean_tpr, linestyle=--,
303 label={} (area = %0.2f, logloss = %0.2f).format(name) %
304 (x_auc, loss_val), lw=2)
305 y_pred_bagging = bagging.predict(x_test)
306 cm_bagging = confusion_matrix(y_test, y_pred_bagging)
307 cm1 = plt.figure()
308 plot_confusion_matrix(cm_bagging,
309 classes=[0, 1],
310 title=Confusion matrix of BaggingClassifier)
311 # balanced_bagging
312 prob = balanced_bagging.predict_proba(x_test)[:, 1]
313 predict_score = [float(%.2f % x) for x in prob]
314 loss_val = log_loss(y_test, predict_score)
315 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
316 mean_fpr = np.linspace(0, 1, 100)
317 mean_tpr = interp(mean_fpr, fpr, tpr)
318 x_auc = auc(fpr, tpr)
319 plt.figure(Bagging) # 选择图
320 name = base_Balanced_Bagging
321 plt.plot(
322 mean_fpr, mean_tpr, linestyle=--,
323 label={} (area = %0.2f, logloss = %0.2f).format(name) %
324 (x_auc, loss_val), lw=2)
325 y_pred_balanced_bagging = balanced_bagging.predict(x_test)
326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging)
327 cm2 = plt.figure()
328 plot_confusion_matrix(cm_balanced_bagging,
329 classes=[0, 1],
330 title=Confusion matrix of BalancedBagging)
331 plt.figure(Bagging) # 选择图
332 plt.plot([0, 1], [0, 1], linestyle=--, lw=2, color=k, label=Luck)
333 # make nice plotting
334 ax.spines[top].set_visible(False)
335 ax.spines[right].set_visible(False)
336 ax.get_xaxis().tick_bottom()
337 ax.get_yaxis().tick_left()
338 ax.spines[left].set_position((outward, 10))
339 ax.spines[bottom].set_position((outward, 10))
340 plt.xlim([0, 1])
341 plt.ylim([0, 1])
342 plt.xlabel(False Positive Rate)
343 plt.ylabel(True Positive Rate)
344 plt.title(Receiver Operating Characteristic)
345 plt.legend(loc="lower right")
346 plt.show()
347 return cm1, cm2, fig
348 
349  
350 
351 
352 def model_baseline2(x_train, y_train, x_test, y_test):
353 params = {
354 task: train,
355 boosting_type: gbdt,
356 objective: multiclass,
357 num_class: 2,
358 verbose: 0,
359 metric: logloss,
360 max_bin: 255,
361 max_depth: 7,
362 learning_rate: 0.3,
363 nthread: 4,
364 n_estimators: 85,
365 num_leaves: 63,
366 feature_fraction: 0.8,
367 num_boost_round: 160,
368 }
369 lgb_train = lgb.Dataset(x_train, label=y_train)
370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train)
371 print("begin train...")
372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
373 prob = bst.predict(x_test)[:, 1]
374 predict_score = [float(%.2f % x) for x in prob]
375 loss_val = log_loss(y_test, predict_score)
376 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
377 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
378 x_auc = auc(fpr, tpr)
379 mean_fpr = np.linspace(0, 1, 100)
380 mean_tpr = interp(mean_fpr, fpr, tpr)
381 fig = plt.figure(weighted)
382 ax = fig.add_subplot(1, 1, 1)
383 name = base_lgb
384 plt.plot(mean_fpr, mean_tpr, linestyle=--,
385 label={} (area = %0.2f, logloss = %0.2f).format(name) %
386 (x_auc, loss_val), lw=2)
387 cm1 = plt.figure()
388 cm = confusion_matrix(y_test, y_pred)
389 plot_confusion_matrix(cm, classes=[0, 1],
390 title=Confusion matrix basemodle)
391 # add weighted according to the labels
392 lgb_train = lgb.Dataset(
393 x_train, label=y_train,
394 weight=[1 if y == 1 else 0.02 for y in y_train])
395 lgb_eval = lgb.Dataset(
396 x_test, label=y_test, reference=lgb_train,
397 weight=[1 if y == 1 else 0.02 for y in y_test])
398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
399 prob = bst.predict(x_test)[:, 1]
400 predict_score = [float(%.2f % x) for x in prob]
401 loss_val = log_loss(y_test, predict_score)
402 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
403 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
404 mean_fpr = np.linspace(0, 1, 100)
405 mean_tpr = interp(mean_fpr, fpr, tpr)
406 x_auc = auc(fpr, tpr)
407 plt.figure(weighted) # 选择图
408 name = base_lgb_weighted
409 plt.plot(
410 mean_fpr, mean_tpr, linestyle=--,
411 label={} (area = %0.2f, logloss = %0.2f).format(name) %
412 (x_auc, loss_val), lw=2)
413 cm2 = plt.figure()
414 cm = confusion_matrix(y_test, y_pred)
415 plot_confusion_matrix(cm, classes=[0, 1],
416 title=Confusion matrix basemodle)
417 plt.figure(weighted) # 选择图
418 plt.plot([0, 1], [0, 1], linestyle=--, lw=2, color=k, label=Luck)
419 # make nice plotting
420 ax.spines[top].set_visible(False)
421 ax.spines[right].set_visible(False)
422 ax.get_xaxis().tick_bottom()
423 ax.get_yaxis().tick_left()
424 ax.spines[left].set_position((outward, 10))
425 ax.spines[bottom].set_position((outward, 10))
426 plt.xlim([0, 1])
427 plt.ylim([0, 1])
428 plt.xlabel(False Positive Rate)
429 plt.ylabel(True Positive Rate)
430 plt.title(Receiver Operating Characteristic)
431 plt.legend(loc="lower right")
432 plt.show()
433 return cm1, cm2, fig
434 
435  
436 
437 
438 ‘‘‘
439 1. logloss VS AUC
440 虽然 baseline 的 logloss= 0.0819, 确实很小,但是从 Confusion matrix 看出,
441 模型倾向于将所有的数据都分成多的那个,加了weight 之后稍好一点?
442 Though the logloss is 0.0819, which is a very small value.
443 Confusion matrix shows y_pred all 0, which feavors the majority classes.
444 
445  
446 
447 AUC 只有 0.64~0.67.
448 AUC如此小,按理来说不应该啊,但是为什么呢?
449 因为数据的label 极度不平衡,1 的比例大概只有 2%. 50:1.
450 AUC 对不平衡数据的分类性能测试更友好,用AUC去选特征,可能结果更好哦。
451 这里只提供一个大概的思考改进点。
452 2. handling with imbalanced data:
453 1. resampling, over- or under-,
454 over- is increasing # of minority, under- is decreasing # of majority.
455 2. revalue the loss function by giving large loss of misclassifying the
456 minority labels.
457 ‘‘‘
458 
459  
460 
461 
462 if __name__ == "__main__":
463 x_train, x_test, y_train, y_test = data_baseline()
464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test)
465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test)
466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test)
467 
468  
469 
470 fig1.savefig(./base_lgb_weighted.jpg, format=jpg)
471 cm11.savefig(./Confusion matrix1.jpg, format=jpg)
472 cm12.savefig(./Confusion matrix2.jpg, format=jpg)

 

以上是关于搜索广告 - 不平衡数据 Imbalanced Data的主要内容,如果未能解决你的问题,请参考以下文章

非平衡数据(imbalanced data)的简单介绍

python使用imbalanced-learn的RepeatedEditedNearestNeighbours方法进行下采样处理数据不平衡问题

python使用imbalanced-learn的InstanceHardnessThreshold方法进行下采样处理数据不平衡问题

python使用imbalanced-learn的CondensedNearestNeighbour方法进行下采样处理数据不平衡问题

python使用imbalanced-learn的NeighbourhoodCleaningRule方法进行下采样处理数据不平衡问题

python使用imbalanced-learn的EditedNearestNeighbours方法进行下采样处理数据不平衡问题