preProc = c("center", "scale") 在插入符号包 (R) 和 min-max 归一化中的含义
Posted
技术标签:
【中文标题】preProc = c("center", "scale") 在插入符号包 (R) 和 min-max 归一化中的含义【英文标题】:preProc = c("center", "scale") meaning in caret's package (R) and min-max normalization 【发布时间】:2020-09-24 21:26:52 【问题描述】:我想知道如何在caret
的train()
函数中使用preProc
。我正在使用neuralnet
在train()
函数中运行神经网络。代码来自this question。
这实际上是代码:
nn <- train(medv ~ .,
data = df,
method = "neuralnet",
tuneGrid = grid,
metric = "RMSE",
preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE)
)
原始数据没有进行缩放,因此建议在运行神经网络之前对数据进行缩放。
但是,在参数preProc
中出现了三个元素:center
、scale
、nzv
。我在解释这些值时遇到问题,因为我不知道它们为什么存在。此外,我想使用 min-max 缩放/标准化我的数据。这将是函数:
maxs = apply(pk_dc_only$C, 2, max)
mins = apply(pk_dc_only$C, 2, min)
scaled = as.data.frame(scale(df, center = mins, scale = maxs - mins))
是否可以在preProc
内使用最小-最大缩放来规范化我的数据?
如果是这样,我如何在预测时撤消缩放?
【问题讨论】:
【参考方案1】:三个选项 c("center", "scale", "nzv") 做缩放和居中,在vignette:
method = "center" 减去预测变量数据的平均值(同样 来自 x) 中的数据来自预测变量值 while method = "scale" 除以标准差。
而nzv
基本上排除了方差接近于零的变量,这意味着它们几乎是恒定的并且很可能对预测没有用处。做 min max,有一个选项:
“范围”转换将数据缩放到“范围边界”内。 如果新样本的值大于或小于 训练集,值将超出此范围。
我们在下面试试:
library(mlbench)
data(BostonHousing)
library(caret)
idx = sample(nrow(BostonHousing),400)
df = BostonHousing[idx,]
df$chas = as.numeric(df$chas)
pre_mdl = preProcess(df,method="range")
nn <- train(medv ~ ., data = predict(pre_mdl,df),
method = "neuralnet",tuneGrid=G,
metric = "RMSE",trControl = trainControl(
method = "cv",number = 5,verboseIter = TRUE))
nn$preProcess
Created from 400 samples and 13 variables
Pre-processing:
- ignored (0)
- re-scaling to [0, 1] (13)
summary(nn$finalModel$data)
crim zn indus chas
Min. :0.000000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000821 1st Qu.:0.0000 1st Qu.:0.1646 1st Qu.:0.0000
Median :0.002454 Median :0.0000 Median :0.2969 Median :0.0000
Mean :0.042130 Mean :0.1309 Mean :0.3804 Mean :0.0625
3rd Qu.:0.039150 3rd Qu.:0.2000 3rd Qu.:0.6466 3rd Qu.:0.0000
Max. :1.000000 Max. :1.0000 Max. :1.0000 Max. :1.0000
nox rm age dis
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.1276 1st Qu.:0.4470 1st Qu.:0.4032 1st Qu.:0.08522
Median :0.2819 Median :0.5076 Median :0.7503 Median :0.20133
Mean :0.3363 Mean :0.5232 Mean :0.6647 Mean :0.25146
3rd Qu.:0.4918 3rd Qu.:0.5880 3rd Qu.:0.9361 3rd Qu.:0.38622
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
rad tax ptratio b
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.1304 1st Qu.:0.1770 1st Qu.:0.5106 1st Qu.:0.9475
Median :0.1739 Median :0.2729 Median :0.6862 Median :0.9861
Mean :0.3676 Mean :0.4171 Mean :0.6243 Mean :0.8987
3rd Qu.:1.0000 3rd Qu.:0.9141 3rd Qu.:0.8085 3rd Qu.:0.9983
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
lstat .outcome
Min. :0.0000 Min. :0.0000
1st Qu.:0.1492 1st Qu.:0.2683
Median :0.2705 Median :0.3644
Mean :0.3069 Mean :0.3902
3rd Qu.:0.4220 3rd Qu.:0.4450
Max. :1.0000 Max. :1.0000
不太清楚“预测时撤消缩放”是什么意思。也许你的意思是把它们翻译回原来的比例:
test = BostonHousing[-idx,]
test$chas = as.numeric(test$chas)
test_medv = test$medv
test = predict(pre_mdl,test)
范围存储在 preProcess 模型下,在
pre_mdl$ranges
crim zn indus chas nox rm age dis rad tax ptratio b
[1,] 0.00632 0 0.46 1 0.385 3.561 2.9 1.1691 1 187 12.6 0.32
[2,] 88.97620 100 27.74 2 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
lstat medv
[1,] 1.73 5
[2,] 36.98 50
所以我们写了一个包装器:
convert_response = function(value,mdl,method,column)
bounds = mdl[[method]][,column]
value*diff(bounds) + min(bounds)
plot(test_medv,convert_response(predict(nn,test),pre_mdl,"ranges","medv"),
ylab="predicted")
【讨论】:
以上是关于preProc = c("center", "scale") 在插入符号包 (R) 和 min-max 归一化中的含义的主要内容,如果未能解决你的问题,请参考以下文章
Silverlight:文本框 VerticalContentAlignment="Center"