2022 CCF BDCI大赛之返乡发展人群预测 | StratifiedKFold和Lightgbm应用

Posted 2022-12-07 bare head

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了2022 CCF BDCI大赛之返乡发展人群预测 | StratifiedKFold和Lightgbm应用相关的知识，希望对你有一定的参考价值。

一、赛题介绍

1.赛题背景

近年来，中国经济飞速发展。随着一线城市的生活压力不断增大，越来越多的年轻人选择了返回家乡发展。中国联通大数据体系已形成九大类3000个以上标签，每日千亿级数据采集加工能力，pb级存储能力能够提供4亿用户的全样本数据。中国联通也一直在进行针对大数据应用和服务模式的创新研究为数据行业提供不少的经验和指导。中国联通的大数据业务在运营商中连续多年稳居第一。

2.赛题任务

基于中国联通的大数据能力，通过使用对联通的信令数据、通话数据、互联网行为等数据进行建模，对个人是否会返乡工作进行判断。

二、解题思路

要通过调用两个库完成任务，分别为StratifiedKFold和Lightgbm。

StratifiedKFold（用于划分数据集）：与我们机器学习课熟悉的KFold不同，KFold划分数据集的原理是根据n_split直接进行划分，而StratifiedKFold划分数据集，划分后的训练集和验证集中类别分布尽量和原数据集一样。而其传入的参数与KFold类似，为n_split,random_state以及shuffle，作用分别为确定划分个数，确定构建模型以及打乱数据。
Lightgbm（用于模型训练）：GBDT (Gradient Boosting Decision Tree) 是机器学习中一个长盛不衰的模型，其主要思想是利用弱分类器（决策树）迭代训练以得到最优模型，该模型具有训练效果好、不易过拟合等优点。LightGBM（Light Gradient Boosting Machine）是一个实现GBDT算法的框架，支持高效率的并行训练，并且具有更快的训练速度、更低的内存消耗、更好的准确率、支持分布式可以快速处理海量数据等优点。

三、代码实现

导入库

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold

Lightgbm：用于模型训练

StratufiedKFold：用于划分数据集

2.数据读取

train = pd.read_csv('E:/ccf返乡发展人员预测/dataTrain.csv')
test = pd.read_csv('E:/ccf返乡发展人员预测/dataA.csv')
train['f3'] = train['f3'].map('low': 1, 'mid': 2, 'high': 3)
test['f3'] = test['f3'].map('low': 1, 'mid': 2, 'high': 3)

3.建立额外特征

loc_f = ['f1', 'f2', 'f4', 'f5', 'f6']
for i in range(len(loc_f)):
    for j in range(i + 1, len(loc_f)):
        train[f'loc_f[i]+loc_f[j]'] = train[loc_f[i]] + train[loc_f[j]]
        train[f'loc_f[i]-loc_f[j]'] = train[loc_f[i]] - train[loc_f[j]]
        train[f'loc_f[i]*loc_f[j]'] = train[loc_f[i]] * train[loc_f[j]]
        train[f'loc_f[i]/loc_f[j]'] = train[loc_f[i]] / train[loc_f[j]]
        train[f'loc_f[i]&loc_f[j]'] = train[loc_f[i]] & train[loc_f[j]]
        train[f'loc_f[i]|loc_f[j]'] = train[loc_f[i]] | train[loc_f[j]]
        train[f'loc_f[i]^loc_f[j]'] = train[loc_f[i]] ^ train[loc_f[j]]
        test[f'loc_f[i]+loc_f[j]'] = test[loc_f[i]] + test[loc_f[j]]
        test[f'loc_f[i]-loc_f[j]'] = test[loc_f[i]] - test[loc_f[j]]
        test[f'loc_f[i]*loc_f[j]'] = test[loc_f[i]] * test[loc_f[j]]
        test[f'loc_f[i]/loc_f[j]'] = test[loc_f[i]] / test[loc_f[j]]
        test[f'loc_f[i]&loc_f[j]'] = test[loc_f[i]] & test[loc_f[j]]
        test[f'loc_f[i]|loc_f[j]'] = test[loc_f[i]] | test[loc_f[j]]
        test[f'loc_f[i]^loc_f[j]'] = test[loc_f[i]] ^ test[loc_f[j]]
com_f = ['f43', 'f44', 'f45', 'f46']
for i in range(len(com_f)):
    for j in range(i + 1, len(com_f)):
        train[f'com_f[i]+com_f[j]'] = train[com_f[i]] + train[com_f[j]]
        train[f'com_f[i]-com_f[j]'] = train[com_f[i]] - train[com_f[j]]
        train[f'com_f[i]*com_f[j]'] = train[com_f[i]] * train[com_f[j]]
        train[f'com_f[i]/com_f[j]'] = train[com_f[i]] / train[com_f[j]]
        train[f'com_f[i]&com_f[j]'] = train[com_f[i]] & train[com_f[j]]
        train[f'com_f[i]|com_f[j]'] = train[com_f[i]] | train[com_f[j]]
        train[f'com_f[i]^com_f[j]'] = train[com_f[i]] ^ train[com_f[j]]
        test[f'com_f[i]+com_f[j]'] = test[com_f[i]] + test[com_f[j]]
        test[f'com_f[i]-com_f[j]'] = test[com_f[i]] - test[com_f[j]]
        test[f'com_f[i]*com_f[j]'] = test[com_f[i]] * test[com_f[j]]
        test[f'com_f[i]/com_f[j]'] = test[com_f[i]] / test[com_f[j]]
        test[f'com_f[i]&com_f[j]'] = test[com_f[i]] & test[com_f[j]]
        test[f'com_f[i]|com_f[j]'] = test[com_f[i]] | test[com_f[j]]
        test[f'com_f[i]^com_f[j]'] = test[com_f[i]] ^ test[com_f[j]]

优化1：通过运算符建立额外特征，可以新构建多组数据以增加分析准确性。

4.排除噪声数据

train = train[:50000]

优化2：通过测试得到训练组的后10000组数据为噪声数据，于是将其剔除，只保留前50000组数据。

5.使用StratifiedKFold来训练数据

features = [i for i in train.columns if i not in ['label',  'id']]
y = train['label']
KF = StratifiedKFold(n_splits=5, random_state=4000, shuffle=True)
feat_imp_df = pd.DataFrame('feat': features, 'imp': 0)
params = 
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': 'auc',
    'n_jobs': 30,
    'learning_rate': 0.05,
    'num_leaves': 64,
    'max_depth': 8,
    'tree_learner': 'serial',
    'subsample_freq': 1,
    'subsample': 0.9,
    'num_boost_round': 3000,
    'early_stopping_rounds': 300,
    'max_bin': 255,
    'verbose': -1,
    'seed': 4000,
    'bagging_seed': 4000,
    'feature_fraction_seed': 4000,

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros((len(test)))
for fold_, (trn_idx, val_idx) in enumerate(KF.split(train.values, y.values)):
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=y.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=y.iloc[val_idx])
    num_round = 5000
    clf = lgb.train(
        params,
        trn_data,
        num_round,
        valid_sets=[trn_data, val_data],
        verbose_eval=100,
        early_stopping_rounds=50,
    )

    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    predictions_lgb[:] += clf.predict(test[features], num_iteration=clf.best_iteration) / 5
    feat_imp_df['imp'] += clf.feature_importance() / 5

使用StratifiedKFold来训练数据，较之KFold的优势为：划分后的训练集和验证集中类别分布尽量和原数据集一样。

6.将结果保存到Excel中

test['label'] = predictions_lgb
test[['id', 'label']].to_csv('E:/ccf返乡发展人员预测/submission.csv', index=False)

7.完整代码

# 导入相关库
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold

# 读取数据
train = pd.read_csv('E:/ccf返乡发展人员预测/dataTrain.csv')
test = pd.read_csv('E:/ccf返乡发展人员预测/dataA.csv')
train['f3'] = train['f3'].map('low': 1, 'mid': 2, 'high': 3)
test['f3'] = test['f3'].map('low': 1, 'mid': 2, 'high': 3)

# 建立额外特征
loc_f = ['f1', 'f2', 'f4', 'f5', 'f6']
for i in range(len(loc_f)):
    for j in range(i + 1, len(loc_f)):
        train[f'loc_f[i]+loc_f[j]'] = train[loc_f[i]] + train[loc_f[j]]
        train[f'loc_f[i]-loc_f[j]'] = train[loc_f[i]] - train[loc_f[j]]
        train[f'loc_f[i]*loc_f[j]'] = train[loc_f[i]] * train[loc_f[j]]
        train[f'loc_f[i]/loc_f[j]'] = train[loc_f[i]] / train[loc_f[j]]
        train[f'loc_f[i]&loc_f[j]'] = train[loc_f[i]] & train[loc_f[j]]
        train[f'loc_f[i]|loc_f[j]'] = train[loc_f[i]] | train[loc_f[j]]
        train[f'loc_f[i]^loc_f[j]'] = train[loc_f[i]] ^ train[loc_f[j]]
        test[f'loc_f[i]+loc_f[j]'] = test[loc_f[i]] + test[loc_f[j]]
        test[f'loc_f[i]-loc_f[j]'] = test[loc_f[i]] - test[loc_f[j]]
        test[f'loc_f[i]*loc_f[j]'] = test[loc_f[i]] * test[loc_f[j]]
        test[f'loc_f[i]/loc_f[j]'] = test[loc_f[i]] / test[loc_f[j]]
        test[f'loc_f[i]&loc_f[j]'] = test[loc_f[i]] & test[loc_f[j]]
        test[f'loc_f[i]|loc_f[j]'] = test[loc_f[i]] | test[loc_f[j]]
        test[f'loc_f[i]^loc_f[j]'] = test[loc_f[i]] ^ test[loc_f[j]]
com_f = ['f43', 'f44', 'f45', 'f46']
for i in range(len(com_f)):
    for j in range(i + 1, len(com_f)):
        train[f'com_f[i]+com_f[j]'] = train[com_f[i]] + train[com_f[j]]
        train[f'com_f[i]-com_f[j]'] = train[com_f[i]] - train[com_f[j]]
        train[f'com_f[i]*com_f[j]'] = train[com_f[i]] * train[com_f[j]]
        train[f'com_f[i]/com_f[j]'] = train[com_f[i]] / train[com_f[j]]
        train[f'com_f[i]&com_f[j]'] = train[com_f[i]] & train[com_f[j]]
        train[f'com_f[i]|com_f[j]'] = train[com_f[i]] | train[com_f[j]]
        train[f'com_f[i]^com_f[j]'] = train[com_f[i]] ^ train[com_f[j]]
        test[f'com_f[i]+com_f[j]'] = test[com_f[i]] + test[com_f[j]]
        test[f'com_f[i]-com_f[j]'] = test[com_f[i]] - test[com_f[j]]
        test[f'com_f[i]*com_f[j]'] = test[com_f[i]] * test[com_f[j]]
        test[f'com_f[i]/com_f[j]'] = test[com_f[i]] / test[com_f[j]]
        test[f'com_f[i]&com_f[j]'] = test[com_f[i]] & test[com_f[j]]
        test[f'com_f[i]|com_f[j]'] = test[com_f[i]] | test[com_f[j]]
        test[f'com_f[i]^com_f[j]'] = test[com_f[i]] ^ test[com_f[j]]

# 剔除噪声数据，只保留前50000组数据。
train = train[:50000]

# 使用StratifiedKFold来训练数据
features = [i for i in train.columns if i not in ['label',  'id']]
y = train['label']
KF = StratifiedKFold(n_splits=5, random_state=4000, shuffle=True)
feat_imp_df = pd.DataFrame('feat': features, 'imp': 0)
params = 
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': 'auc',
    'n_jobs': 30,
    'learning_rate': 0.05,
    'num_leaves': 64,
    'max_depth': 8,
    'tree_learner': 'serial',
    'subsample_freq': 1,
    'subsample': 0.9,
    'num_boost_round': 3000,
    'early_stopping_rounds': 300,
    'max_bin': 255,
    'verbose': -1,
    'seed': 4000,
    'bagging_seed': 4000,
    'feature_fraction_seed': 4000,

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros((len(test)))
for fold_, (trn_idx, val_idx) in enumerate(KF.split(train.values, y.values)):
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=y.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=y.iloc[val_idx])
    num_round = 5000
    clf = lgb.train(
        params,
        trn_data,
        num_round,
        valid_sets=[trn_data, val_data],
        verbose_eval=100,
        early_stopping_rounds=50,
    )

    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    predictions_lgb[:] += clf.predict(test[features], num_iteration=clf.best_iteration) / 5
    feat_imp_df['imp'] += clf.feature_importance() / 5

# 将结果保存到Excel
test['label'] = predictions_lgb
test[['id', 'label']].to_csv('E:/ccf返乡发展人员预测/submission.csv', index=False)

四、结果与总结

团队通过资料收集和数据预处理，利用StratifiedKFold划分数据集，利用Lightgbm进行模型训练等，生成预测文件，并提交结果。分析结构并尽团队所能改进。
优化方向：

通过运算符，建立额外特征，新构建多组数据以增加分析准确性。
通过测试得到训练组的后10000组数据为噪声数据，于是将其剔除，只保留前50000组数据。

一个人可能会懒散，但为了团队必须努力。
希望能对您有所帮助~

以上是关于2022 CCF BDCI大赛之返乡发展人群预测 | StratifiedKFold和Lightgbm应用的主要内容，如果未能解决你的问题，请参考以下文章

聚焦可信AI，CCF BDCI 2022阅读理解可解释评测获奖方案直播分享

ModelArts的雪中送炭，让我拿下CCF BDCI华为Severless工作负载预测亚军

CCF BDCI大赛急速报名，OneFlow四大训练赛题等你来战