为啥 StandardScaler 不将元数据附加到输出列?
Posted
技术标签:
【中文标题】为啥 StandardScaler 不将元数据附加到输出列?【英文标题】:Why does StandardScaler not attach metadata to the output column?为什么 StandardScaler 不将元数据附加到输出列? 【发布时间】:2017-06-20 11:09:24 【问题描述】:我注意到 ml StandardScaler
没有将元数据附加到输出列:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
val df = spark.read.option("header", true)
.option("inferSchema", true)
.csv("/path/to/cars.data")
val strId1 = new StringIndexer()
.setInputCol("v7")
.setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
.setInputCol("v8")
.setOutputCol("v8_IDX")
val assmbleFeatures: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
.setOutputCol("featuresRaw")
val scalerModel = new StandardScaler()
.setInputCol("featuresRaw")
.setOutputCol("scaledFeatures")
val plm = new Pipeline()
.setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
.fit(df)
val dft = plm.transform(df)
dft.schema("scaledFeatures").metadata
给:
res1: org.apache.spark.sql.types.Metadata =
此示例适用于this dataset(只需调整上面代码中的路径)。
这有什么具体原因吗?这个功能将来会不会被添加到 Spark 中?对于不包括复制 StandardScaler 的解决方法有什么建议吗?
【问题讨论】:
【参考方案1】:虽然丢弃元数据可能不是最幸运的选择,但缩放索引分类特征没有任何意义。 StringIndexer
返回的值只是标签。
如果要对数值特征进行缩放,应该是一个单独的阶段:
val numericAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
.setOutputCol("numericFeatures")
val scaler = new StandardScaler()
.setInputCol("numericFeatures")
.setOutputCol("scaledNumericFeatures")
val finalAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
.setOutputCol("features")
new Pipeline()
.setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
.fit(df)
请记住此答案开头提出的问题,您也可以尝试复制元数据:
val result = plm.transform(df).transform(df =>
df.withColumn(
"scaledFeatures",
$"scaledFeatures".as(
"scaledFeatures",
df.schema("featuresRaw").metadata)))
esult.schema("scaledFeatures").metadata
"ml_attr":"attrs":"numeric":["idx":0,"name":"v0","idx":1,"name":"v1","idx":2,"name":"v2","idx":3,"name":"v3","idx":4,"name":"v4","idx":5,"name":"v5","idx":6,"name":"v6"],"nominal":["vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"],"num_attrs":8
【讨论】:
这是一个很好的观点,我还没有考虑到 - 谢谢!以上是关于为啥 StandardScaler 不将元数据附加到输出列?的主要内容,如果未能解决你的问题,请参考以下文章
为啥在 GridSearchCV 中使用 StandardScaler 时会得到不同的结果?
使用 StandardScaler() 规范化 pandas 数据帧,不包括特定列