ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略
Posted 一个处女座的程序猿
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略相关的知识,希望对你有一定的参考价值。
ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略
目录
通过数据预处理利用LightGBM算法实现二手汽车产品交易价格回归预测
# 1.4、合并训练集、测试集(标记数据来源):以便同步各种操作(特征处理、构造特征等)
# T2、基于3-Sigma标准差的删除异常样本点+箱线图对比可视化
# 2.5.1、统计并可视化所有变量的偏态skew、峰态kurt
# 2.7.2、各个特征的与目标变量的柱形图/箱形图/小提琴图可视化
二手汽车产品交易价格预测
官网地址:零基础入门数据挖掘 - 二手车交易价格预测_学习赛_赛题与数据_天池大赛-阿里云天池
赛题背景
赛题以二手车市场为背景,要求选手预测二手汽车的交易价格。
字段说明
该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱敏。
Field | Description |
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 汽车编码 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
通过数据预处理利用LightGBM算法实现二手汽车产品交易价格回归预测
# 一、定义数据集
# 1.1、载入训练集和测试集
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 |
0 | 736 | 20040402 | 30 | 6 | 1 | 0 | 0 | 60 | 12.5 | 0 | 1046 | 0 | 0 | 20160404 | 1850 | 43.35779631 | 3.966344166 | 0.050257094 | 2.159744094 | 1.143786187 | 0.235675907 | 0.101988241 | 0.129548661 | 0.022816367 | 0.097461829 | -2.881803239 | 2.804096771 | -2.420820793 | 0.795291943 | 0.9147625 |
1 | 2262 | 20030301 | 40 | 1 | 2 | 0 | 0 | 0 | 15 | - | 4366 | 0 | 0 | 20160309 | 3600 | 45.30527302 | 5.236111898 | 0.137925324 | 1.38065746 | -1.422164921 | 0.264777256 | 0.121003594 | 0.135730707 | 0.026597448 | 0.020581663 | -4.900481882 | 2.096337644 | -1.030482837 | -1.722673775 | 0.245522411 |
2 | 14874 | 20040403 | 115 | 15 | 1 | 0 | 0 | 163 | 12.5 | 0 | 2806 | 0 | 0 | 20160402 | 6222 | 45.97835906 | 4.823792215 | 1.319524152 | -0.998467274 | -0.996911035 | 0.251410148 | 0.114912277 | 0.165147493 | 0.062172837 | 0.027074824 | -4.84674926 | 1.803558941 | 1.565329625 | -0.832687327 | -0.229962856 |
3 | 71865 | 19960908 | 109 | 10 | 0 | 0 | 1 | 193 | 15 | 0 | 434 | 0 | 0 | 20160312 | 2400 | 45.6874782 | 4.492574134 | -0.050615843 | 0.883599671 | -2.228078725 | 0.274293171 | 0.110300085 | 0.121963746 | 0.033394547 | 0 | -4.509598824 | 1.285939744 | -0.501867908 | -2.438352737 | -0.478699379 |
4 | 111080 | 20120103 | 110 | 5 | 1 | 0 | 0 | 68 | 5 | 0 | 6977 | 0 | 0 | 20160313 | 5200 | 44.38351084 | 2.031433258 | 0.572168948 | -1.571239028 | 2.246088325 | 0.228035622 | 0.073205054 | 0.091880479 | 0.078819385 | 0.121534241 | -1.896240279 | 0.910783134 | 0.931109559 | 2.83451782 | 1.923481963 |
# 1.2、简略观察数据
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 150000 non-null int64
1 name 150000 non-null int64
2 regDate 150000 non-null int64
3 model 149999 non-null float64
4 brand 150000 non-null int64
5 bodyType 145494 non-null float64
6 fuelType 141320 non-null float64
7 gearbox 144019 non-null float64
8 power 150000 non-null int64
9 kilometer 150000 non-null float64
10 notRepairedDamage 150000 non-null object
11 regionCode 150000 non-null int64
12 seller 150000 non-null int64
13 offerType 150000 non-null int64
14 creatDate 150000 non-null int64
15 price 150000 non-null int64
16 v_0 150000 non-null float64
17 v_1 150000 non-null float64
18 v_2 150000 non-null float64
19 v_3 150000 non-null float64
20 v_4 150000 non-null float64
21 v_5 150000 non-null float64
22 v_6 150000 non-null float64
23 v_7 150000 non-null float64
24 v_8 150000 non-null float64
25 v_9 150000 non-null float64
26 v_10 150000 non-null float64
27 v_11 150000 non-null float64
28 v_12 150000 non-null float64
29 v_13 150000 non-null float64
30 v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
used_car.info:
None
used_car.shape: (150000, 31) 31 150000
used_car.columns:
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
used_car.dtypes:
float64 20
int64 10
object 1
dtype: int64
used_car.head:
SaleID name regDate model ... v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 ... 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 ... 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 ... 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 ... 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 ... 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 ... -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 ... -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 ... -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 ... -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 ... -3.179913 0.031724 -1.483350 -0.342674
[10 rows x 31 columns]
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
count | 150000 | 150000 | 150000 | 149999 | 150000 | 145494 | 141320 | 144019 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 |
mean | 74999.5 | 68349.17287 | 20034170.51 | 47.12902086 | 8.052733333 | 1.792369445 | 0.375842061 | 0.224942542 | 119.3165467 | 12.59716 | 2583.077267 | 6.67E-06 | 0 | 20160330.79 | 5923.327333 | 44.40626753 | -0.044809123 | 0.080765058 | 0.078833423 | 0.017874615 | 0.248203528 | 0.044923004 | 0.124692461 | 0.058143855 | 0.061995895 | -0.001000239 | 0.009034543 | 0.004812595 | 0.000312612 | -0.000688231 |
std | 43301.41453 | 61103.87509 | 53649.87926 | 49.53603965 | 7.864956341 | 1.760639503 | 0.548676623 | 0.417545932 | 177.1684192 | 3.919575532 | 1885.363218 | 0.002581989 | 0 | 106.7328088 | 7501.998477 | 2.457547906 | 3.641893018 | 2.929617945 | 2.026514036 | 1.193661387 | 0.045803971 | 0.051742787 | 0.20140953 | 0.029185756 | 0.035691979 | 3.772386394 | 3.286071221 | 2.517477676 | 1.288987639 | 1.038685151 |
min | 0 | 0 | 19910001 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0 | 0 | 0 | 20150618 | 11 | 30.45197649 | -4.295588903 | -4.47067143 | -7.275036707 | -4.364565242 | 0 | 0 | 0 | 0 | 0 | -9.16819241 | -5.558206704 | -9.639552114 | -4.153898796 | -6.546555965 |
25% | 37499.75 | 11156 | 19990912 | 10 | 1 | 0 | 0 | 0 | 75 | 12.5 | 1018 | 0 | 0 | 20160313 | 1300 | 43.13579888 | -3.192349286 | -0.9706712 | -1.462580044 | -0.921191484 | 0.243615353 | 3.81E-05 | 0.062473533 | 0.035333687 | 0.033930177 | -3.72230288 | -1.951543007 | -1.871845761 | -1.057788984 | -0.437033668 |
50% | 74999.5 | 51638 | 20030912 | 30 | 6 | 1 | 0 | 0 | 110 | 15 | 2196 | 0 | 0 | 20160321 | 3250 | 44.61026572 | -3.052671416 | -0.38294689 | 0.099721985 | -0.075910429 | 0.257797966 | 0.000812059 | 0.095865898 | 0.057013598 | 0.058483667 | 1.624076331 | -0.358052697 | -0.130753318 | -0.036244604 | 0.141245993 |
75% | 112499.25 | 118841.25 | 20071109 | 66 | 13 | 3 | 1 | 0 | 150 | 15 | 3843 | 0 | 0 | 20160329 | 7700 | 46.0047209 | 4.000669795 | 0.241334852 | R语言之缺失值和异常值处理 |