在尝试评估决策树回归模型时测试分数 NaN
Posted
技术标签:
【中文标题】在尝试评估决策树回归模型时测试分数 NaN【英文标题】:Test score NaN while trying to evaluate a decision tree regressor model 【发布时间】:2021-08-24 13:16:36 【问题描述】:我正在尝试使用来自 ames 住房数据集的数字和分类特征来评估决策树模型的准确性。对于数值特征的预处理,我使用了 SimpleImputer 和 StandardScalar。至于分类特征,我使用了一个热编码器。我尝试使用 10 折交叉验证来评估决策树模型(决策树回归器),但我得到了测试分数的 Nan 值。这是我的代码:
import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
numerical_features = [
"LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
"GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
"GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",]
data_numerical = data[numerical_features]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
categorical_columns = selector(dtype_include=object)(data)
numerical_columns = selector(dtype_exclude=object)(data)
preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), categorical_columns),
(StandardScaler(), SimpleImputer(), numerical_columns),
)
model = make_pipeline(preprocessor, DecisionTreeRegressor())
cv_results = cross_validate(
model, data, target, cv=10, return_estimator=True, n_jobs=2,
)
scores = cv_results["test_score"]
print(f"Accuracy score by cross-validation "
f"search:\nscores.mean():.3f +/- scores.std():.3f")
这是我得到的考试成绩:
Accuracy score by cross-validation search:
nan +/- nan
为了找出问题的根源,我在交叉验证中传递了 (error_score='raise') 作为参数。结果发现错误是:
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers
or all strings, or boolean mask is allowed
我该如何解决这个问题?任何帮助都感激不尽。谢谢:)
这是我的模型的样子:
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(handle_unknown='ignore'),
['MSZoning', 'Street',
'Alley', 'LotShape',
'LandContour', 'Utilities',
'LotConfig', 'LandSlope',
'Neighborhood', 'Condition1',
'Condition2', 'BldgType',
'HouseStyle', 'RoofStyle',
'RoofMatl', 'Exterior1st',
'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond',
'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure',
'BsmtFinType1',
'BsmtFinType2', 'Heating',
'HeatingQC', 'CentralAir',
'Electrical', ...]),
('standardscaler',
StandardScaler(),
SimpleImputer())])),
('decisiontreeregressor', DecisionTreeRegressor())])
数据:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
dtypes: float64(3), int64(34), object(43)
memory usage: 912.6+ KB
目标:
0 208500
1 181500
2 223500
3 140000
4 250000
...
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, Length: 1460, dtype: int64
【问题讨论】:
您能否提供一个示例,说明打印时model
、data
和target
的外观?听起来您正在其中一个中混合数据类型。
您好,我在帖子中发布了模型、数据和目标的示例。我无法在此处发布它,因为我已达到 cmets 部分的最大字符数。
【参考方案1】:
如果您的一个转换器有多个估计器,在这种情况下,对于数字列,您有 StandardScaler(), SimpleImputer()
,您需要用管道包装它,例如:
np.random.seed(111)
data = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['n1','n2','n3'])
data['c1'] = np.random.choice(['A','B',],100)
target = np.random.normal(0,1,100)
cat_columns = selector(dtype_include=object)(data)
num_columns = selector(dtype_exclude=object)(data)
num_transformer = Pipeline(steps=[
('scaler', StandardScaler()),
('imputer', SimpleImputer())
])
preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), cat_columns),
(num_transformer, num_columns),
)
只需在数据集上进行测试,它就可以工作:
preprocessor.fit_transform(data)[:2]
array([[ 0. , 1. , 0.42472149, -1.23160187, -0.1782728 ],
[ 0. , 1. , 0.95749076, -0.79751471, -1.12825996]])
然后运行一切:
model = make_pipeline(preprocessor, DecisionTreeRegressor())
cv_results = cross_validate(
model, data, target, cv=10, return_estimator=True, n_jobs=2,
)
scores = cv_results["test_score"]
array([-2.45981423, -7.88563769, -1.15523361, -0.56772717, -0.84663734,
-0.61938564, -3.1854688 , -1.44865232, -0.41933732, -3.13719368])
【讨论】:
以上是关于在尝试评估决策树回归模型时测试分数 NaN的主要内容,如果未能解决你的问题,请参考以下文章
Python分类模型实战(KNN逻辑回归决策树SVM)调优调参,评估模型——综合项目
R语言编写自定义函数计算分类模型评估指标:准确度特异度敏感度PPVNPV数据数据为模型预测后的混淆矩阵比较多个分类模型分类性能(逻辑回归决策树随机森林支持向量机)