无标题
Posted 尤尔小屋的猫
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了无标题相关的知识,希望对你有一定的参考价值。
用户画像实战案例
赛题背景
基于支付品牌Elo和商家之间的消费数据,构建机器学习模型,了解顾客的生命周期,从而得知顾客的购物偏好和忠诚度
赛题数据
- train.csv和test.csv:训练集和测试集
- sample_submission.csv:正确的提交文件示例;含有所有参赛者预测的所有的card_id
- historical_transactions.csv:信用卡card_id在给定商家的历史交易记录;最多包含3个月
- merchants.csv:数据集中所有商家的附加信息
- new_merchant_transactions.csv:每张信用卡在新商家的购物数据,最多包含2个月
- Data_Dictionary.xlsx:数据字典的说明文件;提供上面各表的字段含义
赛题任务
从上面的数据建立模型进行训练,得到所有信用卡的忠诚度
评价指标
采用均方根误差RMSE:
R M S E = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 RMSE=\\sqrt\\frac1n\\sum_i=1^n(y_i-\\hat y _i)^2 RMSE=n1i=1∑n(yi−y^i)2
- y i y_i yi:信用卡真实的忠诚度分数
- y ^ i \\hat y_i y^i:模型预测的信用卡忠诚度分数
数据探索
import pandas as pd
import numpy as np
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')
dd = pd.read_excel("data/Data_Dictionary.xlsx")
hist = pd.read_csv("data/historical_transactions.csv")
mer = pd.read_csv("data/merchants.csv")
new_mer = pd.read_csv("data/new_merchant_transactions.csv")
print(train.shape, test.shape, submission.shape)
(201917, 6) (123623, 5) (123623, 2)
train.head()
first_active_month | card_id | feature_1 | feature_2 | feature_3 | target | |
---|---|---|---|---|---|---|
0 | 2017-06 | C_ID_92a2005557 | 5 | 2 | 1 | -0.820283 |
1 | 2017-01 | C_ID_3d0044924f | 4 | 1 | 0 | 0.392913 |
2 | 2016-08 | C_ID_d639edf6cd | 2 | 2 | 0 | 0.688056 |
3 | 2017-09 | C_ID_186d6a6901 | 4 | 3 | 0 | 0.142495 |
4 | 2017-11 | C_ID_cdbd2c0db2 | 1 | 3 | 0 | -0.159749 |
test.head()
first_active_month | card_id | feature_1 | feature_2 | feature_3 | |
---|---|---|---|---|---|
0 | 2017-04 | C_ID_0ab67a22ab | 3 | 3 | 1 |
1 | 2017-01 | C_ID_130fd0cbdd | 2 | 3 | 0 |
2 | 2017-08 | C_ID_b709037bc5 | 5 | 1 | 1 |
3 | 2017-12 | C_ID_d27d835a9f | 2 | 1 | 0 |
4 | 2015-12 | C_ID_2b5e3df5c2 | 5 | 1 | 1 |
字段含义
train/test
- card_id:信用卡id
- first_active_month:首次使用信用卡月份
- feature1/2/3:离散特征1/2/3
- target:忠诚度,目标值
hist.shape
(29112361, 14)
historical_transactions
- ‘authorized_flag’:是否认证Y/N
- ‘card_id’:信用卡标识
- ‘city_id’:城市ID
- ‘category_1’:类别特征,Y或者N
- ‘installments’:购买商品数量
- ‘category_3’:类别特征,ABCDE
- ‘merchant_category_id’:商品种类ID
- ‘merchant_id’:商品ID
- ‘month_lag’:距离参考日期的月份,例如【-12,-1】、【0,2】
- ‘purchase_amount’:标准化后的购物金额
- ‘purchase_date’:购物日期,2018-02-11 14:57:38
- ‘category_2’:类别特征1-2-3-4-5
- ‘state_id’:州ID
- ‘subsector_id’:商品种类群ID
hist["authorized_flag"].value_counts()
Y 26595452
N 2516909
Name: authorized_flag, dtype: int64
hist["category_2"].value_counts()
1.0 15177199
3.0 3911795
5.0 3725915
4.0 2618053
2.0 1026535
Name: category_2, dtype: int64
merchants
-
‘merchant_id’, 商品ID
-
‘merchant_group_id’,商品组ID
-
‘merchant_category_id’,商品种类ID
-
‘subsector_id’,商品种类群ID
-
‘numerical_1’,匿名数值特征
-
‘numerical_2’,
-
‘category_1’,类别特征Y-N
-
‘most_recent_sales_range’,在最近活跃月份的【销售额】等级,ABCDE依次降低
-
‘most_recent_purchases_range’,在最近活跃月份的【交易数量】等级,ABCDE依次降低
-
‘avg_sales_lag3’,
-
‘avg_sales_lag6’,
-
‘avg_sales_lag12’,
-
‘avg_purchases_lag3’,
-
‘active_months_lag3’,
-
‘avg_purchases_lag6’,
-
‘active_months_lag6’,
-
‘avg_purchases_lag12’,
-
‘active_months_lag12’,
-
‘category_4’,
-
‘city_id’,
-
‘state_id’,
-
‘category_2’
mer.columns.tolist()
['merchant_id',
'merchant_group_id',
'merchant_category_id',
'subsector_id',
'numerical_1',
'numerical_2',
'category_1',
'most_recent_sales_range',
'most_recent_purchases_range',
'avg_sales_lag3',
'avg_purchases_lag3',
'active_months_lag3',
'avg_sales_lag6',
'avg_purchases_lag6',
'active_months_lag6',
'avg_sales_lag12',
'avg_purchases_lag12',
'active_months_lag12',
'category_4',
'city_id',
'state_id',
'category_2']
以上是关于无标题的主要内容,如果未能解决你的问题,请参考以下文章