在 Python Scikit-Learn 中训练测试拆分得分高但 CV 得分低

Posted

技术标签:

【中文标题】在 Python Scikit-Learn 中训练测试拆分得分高但 CV 得分低【英文标题】:High Score in Train Test Split but Low Score in CV in Python Scikit-Learn 【发布时间】:2020-10-25 20:15:11 【问题描述】:

我是数据科学的新手,并且一直在为 Kaggle 的问题而苦苦挣扎。当我使用随机森林回归来预测评分时,使用 Train Test Split 得到高分,而使用 CV Score 得到低分。

带有训练测试 split_randomforest 0.8746277302652172 没有火车测试 split_randomforest 0.8750717943467078 CV 随机森林 10.713885026374156 %

https://www.kaggle.com/data13/machine-learning-model-to-predict-app-rating-94

import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sklearn.model_selection as ms
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn import svm
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


from xgboost import XGBClassifier
from xgboost import XGBRegressor 
from lightgbm import LGBMClassifier


database = pd.read_csv(r"C:\Users\Anson\Downloads\49864_274957_bundle_archive\googleplaystore.csv")# store wine type as an attribute



## Size - Strip the M and k value 
database['Size'] = database['Size'].apply(lambda x : x.strip('M'))
database['Size'] = database['Size'].apply(lambda x : x.strip('k'))
##

## Rating - Fill the Blank Value with median
database['Rating'].fillna(database['Rating'].median(),inplace=True)
database['Rating'].replace(19,database['Rating'].median(),inplace=True) 

###


## Reviews -  replace the blank cell
database['Reviews'].replace('3.0M',3000000,inplace=True) 
database['Reviews'].replace('0',float("NaN"),inplace=True) 
database.dropna(subset=['Reviews'],inplace=True)
##


## Strip the + value
database['Installs'] = database['Installs'].apply(lambda x : x.strip('+'))
database['Installs'] = database['Installs'].apply(lambda x : x.replace(',',''))
database['Price'] = database['Price'].apply(lambda x : x.strip('$'))
###

## Drop Blank 
database['Content Rating'].fillna("NaN",inplace=True)
database.dropna(subset=['Content Rating'],inplace=True)
##

## Drop Wrong Number 
database['Last Updated'].replace('1.0.19',float("NaN"),inplace=True) 
database.dropna(subset=['Last Updated'],inplace=True)
database['Last Updated'] = database['Last Updated'].apply(lambda x : time.mktime(datetime.datetime.strptime(x, '%B %d, %Y').timetuple()))
##




le = preprocessing.LabelEncoder()
database['App'] = le.fit_transform(database['App'])
database['Category'] = le.fit_transform(database['Category'])
database['Content Rating'] = le.fit_transform(database['Content Rating'])
database['Type'] = le.fit_transform(database['Type'])
database['Genres'] = le.fit_transform(database['Genres'])




###############################
##feature engineering

features = ['App', 'Reviews', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated']

X=database[features]
y=database['Rating']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=None)


rfc= RandomForestRegressor()


rfc.fit(X_train,y_train)
rfc.fit(X,y)

rfc_score=rfc.score(X_test,y_test)
rfc_score1=rfc.score(X,y)
score_CV_randomforest = cross_val_score(rfc,X,y,cv=KFold(n_splits=5, shuffle=True),scoring='r2')

score_CV_randomforest = score_CV_randomforest.mean()*100


print("with train test split_randomforest", rfc_score)
print("with no train test split_randomforest", rfc_score1)
print("with CV randomforest", score_CV_randomforest, "%")

【问题讨论】:

很难理解你很长的标题:) 也许“Score”应该是“score”,可能里面的“Score”太多了。 【参考方案1】:

训练/测试拆分: 您正在使用 80:20 的比例来进行训练和测试。

交叉验证 当数据集被随机分成“k”组时。其中一组用作测试集,其余的用作训练集。该模型在训练集上进行训练,并在测试集上进行评分。然后重复该过程,直到将每个唯一组用作测试集。 您正在使用 5 折交叉验证,数据集将被分成 5 组,模型将被单独训练和测试 5 次,因此每个组都有机会成为测试集。

所以产生不同结果的原因是,该模型是在不同的随机样本上训练的。

【讨论】:

以上是关于在 Python Scikit-Learn 中训练测试拆分得分高但 CV 得分低的主要内容,如果未能解决你的问题,请参考以下文章

Python 中用 XGBoost 和 scikit-learn 进行随机梯度增强

在 64 位 python 上训练的 Scikit-Learn 随机森林不会在 32 位 python 上打开

在 scikit-learn 中训练神经网络时提前停止

python机器学习——使用scikit-learn训练感知机模型

如何将训练有素的 scikit-learn 模型导入 android 项目

将经过训练的 SVM 从 scikit-learn 导入到 OpenCV