如何开启一个kaggle项目（流程有哪些？）销量预测项目

Posted 2021-12-09 小乖乖的臭坏坏

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何开启一个kaggle项目（流程有哪些？）销量预测项目相关的知识，希望对你有一定的参考价值。

看了几篇kaggle的solution，总结了一下，完成一个机器学习项目，通常分为以下几步：
1.载入数据&查看数据信息
2.数据预处理1（数据清洗，异常值处理）
3.特征工程（在现有数据中挖掘对任务有用的新特征）
4.数据处理2（将处理过的不同数据表连接、合并…最终划分为训练集和测试集，形成可以用来训练的数据集）
5.模型的选择（试用几个模型来比较效果）
6.模型的比较与融合（挑选好的模型互相融合）
7.模型的训练与保存
8.模型的调用与测试

接下来，以一个例子来分享一下这个项目：

1 项目介绍

项目地址如下：

https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data?select=test.csv

数据集介绍：
在这个网址中，有涉及到任务的数据集，

在旁边可以查看数据的构成和基本信息，不必也做不到用excel打开看一看。因为这个数据集非常大，总共100MB大小，所以excel打开必然会崩。

上述文件的描述可以在下图中看到：

在6个csv文件中的各个数据字段的含义，下图也有表述：

思路：

有用的特征是：往期每个月的数据。由于原数据中给的是每一天的数据，因此需要对每一个商店中每一个商品的数据都按月先求和。
模型可以用之前所有月来预测最后一个月的数据。

代码解析：
载入数据

df_train = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
df_shops = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
df_items = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/items.csv')
df_item_categories = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
df_test = pd.read_csv('./kaggle/input/competitive-data-science-predict-future-sales/test.csv')

查看数据基本信息

df_train.head()
df_test.head()
df_train.info()
df_train.describe()
df_train.isnull().sum()
df_test.isna().sum()

数据预处理

#在df_train里剔除date_block_num和item_price
df_train.drop(['date_block_num','item_price'], axis=1, inplace=True)



df_train['date'] = pd.to_datetime(df_train['date'], dayfirst=True)
df_train['date'] = df_train['date'].apply(lambda x: x.strftime('%Y-%m'))
df_train.head()



df = df_train.groupby(['date','shop_id','item_id']).sum()# sql group by
#透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表，也体会到它的强大功能，而在pandas中它被称作pivot_table
#index先按照shop_id排序，在shop_id里再按照item_id排序；column是date，也就是日期；
df = df.pivot_table(index=['shop_id','item_id'], columns='date', values='item_cnt_day', fill_value=0)
# 	        0	1	2	3	4
# date
# shop_id	0	0	0	0	0
# item_id	30	31	32	33	35
# 2013-01	0	0	6	3	1
# 2013-02（31） 11	10	3	14
# 2013-03	0	0	0	0	0
# 2013-04	0	0	0	0	0
# 2013-05	0	0	0	0	0
#举例，0号商店的30号商品在2013年2月的第一天迈出了31份

#inplace=True ：是指重置索引的结果是否作用在前面的数据上
df.reset_index(inplace=True)
print(df.head().T)

#测试集处理，
df_test = pd.merge(df_test, df, on=['shop_id','item_id'], how='left')#类似concat
df_test.drop(['ID', '2013-01'], axis=1, inplace=True)#将['ID', '2013-01']两lie去除
df_test = df_test.fillna(0)#填充nan
print('Test data:', df_test.head().T)

数据集划分

# split into train and test sets
Y_train = df['2015-10'].values#以最后一维数据作为y
X_train = df.drop(['2015-10'], axis = 1)#除了y以外的其他数据作为特征
X_test = df_test

print(X_train.shape, Y_train.shape)
print(X_test.shape)


#设置训练测试集划分
x_train, x_test, y_train, y_test = train_test_split( X_train, Y_train, test_size=0.20, random_state=1)



print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

模型选择与参数处理

LR = LinearRegression()
LR.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, LR.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, LR.predict(x_test)))
print('Test set score:', LR.score(x_train,y_train))



# %time
RFR = RandomForestRegressor(n_estimators = 100)
RFR.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, RFR.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, RFR.predict(x_test)))
print('Test set score:', RFR.score(x_train,y_train))



# %time
XGB = XGBRegressor(max_depth=16,n_estimators=200,seed=1)
XGB.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, XGB.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, XGB.predict(x_test)))
print('Test set score:', XGB.score(x_train,y_train))



# %time
LGBM = LGBMRegressor(max_depth=16,n_estimators=200,seed=1)
LGBM.fit(x_train,y_train)

print('Train set mse:', mean_squared_error(y_train, LGBM.predict(x_train)))
print('Test set mse:', mean_squared_error(y_test, LGBM.predict(x_test)))
print('Test set score:', LGBM.score(x_train,y_train))



prediction = XGB.predict(X_test)



prediction = list(map(round, prediction))#四舍五入

提交结果

df_submission['item_cnt_month'] = prediction
df_submission.to_csv('prediction.csv', index=False)
df_submission.head()

这里给出另一个类似的kaggle项目代码

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


#导入数据包
import datetime as dt
import logging
import warnings
warnings.filterwarnings('ignore')
import os
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import catboost
from catboost import CatBoostRegressor
from catboost import Pool, cv
from sklearn import metrics
from sklearn.model_selection import GridSearchCV



### 读取数据
items=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
item_cat=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
train_df=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
shops=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
test_df=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
submission=pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')


#打印数据尺寸信息
display(items.shape)
display(item_cat.shape)
display(train_df.shape)
display(shops.shape)
display(test_df.shape)
display(submission.shape)


#打印前5行数据
print(items.head())
print(item_cat.head())
print(train_df.head())
print(shops.head())
print(test_df.head())
print(submission.head())


#打印数据基本信息的同时，检测是否有异常值
train_df.describe()
# 	date_block_num	shop_id	item_id	item_price	item_cnt_day
# count	2.935849e+06	2.935849e+06	2.935849e+06	2.935849e+06	2.935849e+06
# mean	1.456991e+01	3.300173e+01	1.019723e+04	8.908532e+02	1.242641e+00
# std	9.422988e+00	1.622697e+01	6.324297e+03	1.729800e+03	2.618834e+00
# min	0.000000e+00	0.000000e+00	0.000000e+00	-1.000000e+00	-2.200000e+01
# 25%	7.000000e+00	2.200000e+01	4.476000e+03	2.490000e+02	1.000000e+00
# 50%	1.400000e+01	3.100000e+01	9.343000e+03	3.990000e+02	1.000000e+00
# 75%	2.300000e+01	4.700000e+01	1.568400e+04	9.990000e+02	1.000000e+00
# max	3.300000e+01	5.900000e+01	2.216900e+04	3.079800e+05	2.169000e+03
train_df.isnull().sum()##查看是否有异常值
train_df.head()
## 删掉不必要的数据列
train_df.drop(['date_block_num','item_price'],axis=1,inplace=True)
### 将日期时间数据列转换为datetime类型数据
train_df['date'] = pd.to_datetime(train_df['date'])
train_df.head()
## 将datetime类型的数据中的年月赋值给数据新的月（年）份列
train_df['month']=train_df['date'].dt.month
train_df['year']=train_df['date'].dt.year
# 	date	shop_id	item_id	item_cnt_day	month	year
# 0	2013-02-01	59	22154	1.0	2	2013
# 1	2013-03-01	25	2552	1.0	3	2013
# 2	2013-05-01	25	2552	-1.0	5	2013
# 3	2013-06-01	25	2554	1.0	6	2013
# 4	2013-01-15	25	2555	1.0	1	2013
# ...	...	...	...	...	...	...
# 2935844	2015-10-10	25	7409	1.0	10	2015
# 2935845	2015-09-10	25	7460	1.0	9	2015
# 2935846	2015-10-14	25	7459	1.0	10	2015
# 2935847	2015-10-22	25	7440	1.0	10	2015
# 2935848	2015-03-10	25	7460	1.0	3	2015
#不管是几号的数据，一律模糊处理成为某一月份的数据，最后用于相加
train_df['date'] = train_df['date'].apply(lambda x: x.strftime('%Y-%m'))
### 按照时间->商店名->商品名分组并聚合，形成新的数据
data = train_df.groupby(['date','shop_id','item_id']).sum()
#形成一张新的数据透视表，处理成为期望输入模型的数据
data = data.pivot_table(index=['shop_id','item_id'], columns='date', values='item_cnt_day', fill_value=0)
data.reset_index(inplace=True)
# date	shop_id	item_id	2013-01	2013-02	2013-03	2013-04	2013-05	2013-06	2013-07	2013-08	...	2015-03	2015-04	2015-05	2015-06	2015-07	2015-08	2015-09	2015-10	2015-11	2015-12
# 0	0	30	0	31	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
# 1	0	31	0	11	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
# 2	0	32	4	6	2	0	1	1	0	0	...	0	0	0	0	0	0	0	0	0	0
# 3	0	33	2	2	1	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
# 4	0	35	2	6	0	1	0	2	0	2	...	0	0	0	0	0	0	0	0	0	0
# ...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
# 424119	59	22154	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
# 424120	59	22155	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
# 424121	59	22162	0	0	0	0	0	0	0	0	...	2	1	1	0	1	0	1	0	1	0
# 424122	59	22164	0	0	0	0	0	0	0	0	...	1	2	0	0	1	0	0	0	1	0
# 424123	59	22167	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
display(test_df.shape)
# 将测试集和数据进行表的连接
test_df=pd.merge(test_df, data, on=['shop_id','item_id'], how='left')
display(test_df.shape)
test_df = test_df.fillna(0)




# 接着将数据划分成为训练集和测试集
#将最后一个数据作为标签，其他数据作为特征
Y_train = data['2015-12'].values
X_train = data.drop(['2015-12'], axis = 1)
X_test = test_df
print(X_train.shape, Y_train.shape)
print(X_test.shape)
x_train, x_test, y_train, y_test = train_test_split( X_train, Y_train, test_size=0.20, random_state=123)


#打印训练集测试集尺寸
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)



## Prepara a pool of train and validation set
pool_train=Pool(x_train,y_train)
pool_val=Pool(x_test,y_test)



### Define a cv function to fit on data and find the optimal number of iteration keeping other parameters fixed
### Function takes input = catboost object with default params , train data ,train y data 
def modelfit(params,poolX,useTrainCV=True,cv_folds=5,early_stopping_rounds=40):
    if useTrainCV:
        cvresult = cv(params=params, pool=poolX,nfold=cv_folds,early_stopping_rounds=early_stopping_rounds,plot=True)
    return cvresult ## return dataframe for the iteration till the optimal iteration is reached



## Prepara a cv class
params={
    'loss_function':'RMSE'
}



### Object return the optimal number of trees to grow
n_est=modelfit(params,pool_train)



n_est.shape[0]### best iteration =664



### Fit the model with iteration=664
cboost1=CatBoostRegressor(iterations=664,loss_function='RMSE',random_seed=123)
cboost1.fit(x_train,y_train)



#Predict training set:
train_predictions = cboost1.predict(x_train)
#Print model report:
print("\\nModel Report Train")
print("Root Mean Square Error : %.4g" % metrics.mean_squared_error(y_train, train_predictions))
print("R^2 Score (Train): %f" % metrics.r2_score(y_train, train_predictions))



#Predict training set:
test_predictions = cboost1.predict(x_test)
#Print model report:
print("\\nModel Report Test")
print("Root Mean Square Error : %.4g" % metrics.mean_squared_error(y_test, test_predictions))
print("R^2 Score (Train): %f" % metrics.r2_score(y_test, test_predictions))



prediction = cboost1.predict(X_test)
prediction = list(map(round, prediction))
len(prediction)



submission
submission['item_cnt_month'] = prediction
submission.to_csv('prediction.csv', index=False)
submission.tail()
submission.head(50)

作于：
2021-6-9
22：33

以上是关于如何开启一个kaggle项目（流程有哪些？）销量预测项目的主要内容，如果未能解决你的问题，请参考以下文章