为啥 StandardScaler 不将元数据附加到输出列?

Posted

技术标签:

【中文标题】为啥 StandardScaler 不将元数据附加到输出列?【英文标题】:Why does StandardScaler not attach metadata to the output column?为什么 StandardScaler 不将元数据附加到输出列? 【发布时间】:2017-06-20 11:09:24 【问题描述】:

我注意到 ml StandardScaler 没有将元数据附加到输出列:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val df = spark.read.option("header", true)
  .option("inferSchema", true)
  .csv("/path/to/cars.data")

val strId1 = new StringIndexer()
  .setInputCol("v7")
  .setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
  .setInputCol("v8")
  .setOutputCol("v8_IDX")

val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
  .setOutputCol("featuresRaw")

val scalerModel = new StandardScaler()
  .setInputCol("featuresRaw")
  .setOutputCol("scaledFeatures")


val plm = new Pipeline()
  .setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
  .fit(df)

val dft = plm.transform(df)

dft.schema("scaledFeatures").metadata

给:

res1: org.apache.spark.sql.types.Metadata = 

此示例适用于this dataset(只需调整上面代码中的路径)。

这有什么具体原因吗?这个功能将来会不会被添加到 Spark 中?对于不包括复制 StandardScaler 的解决方法有什么建议吗?

【问题讨论】:

【参考方案1】:

虽然丢弃元数据可能不是最幸运的选择,但缩放索引分类特征没有任何意义。 StringIndexer 返回的值只是标签。

如果要对数值特征进行缩放,应该是一个单独的阶段:

val numericAssembler: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
  .setOutputCol("numericFeatures")

val scaler = new StandardScaler()
  .setInputCol("numericFeatures")
  .setOutputCol("scaledNumericFeatures")

val finalAssembler: VectorAssembler = new VectorAssembler() 
  .setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
  .setOutputCol("features")

new Pipeline()
  .setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
  .fit(df)

请记住此答案开头提出的问题,您也可以尝试复制元数据:

val result = plm.transform(df).transform(df => 
  df.withColumn(
   "scaledFeatures", 
   $"scaledFeatures".as(
     "scaledFeatures", 
     df.schema("featuresRaw").metadata)))

esult.schema("scaledFeatures").metadata
"ml_attr":"attrs":"numeric":["idx":0,"name":"v0","idx":1,"name":"v1","idx":2,"name":"v2","idx":3,"name":"v3","idx":4,"name":"v4","idx":5,"name":"v5","idx":6,"name":"v6"],"nominal":["vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"],"num_attrs":8

【讨论】:

这是一个很好的观点,我还没有考虑到 - 谢谢!

以上是关于为啥 StandardScaler 不将元数据附加到输出列?的主要内容,如果未能解决你的问题,请参考以下文章

为啥在 GridSearchCV 中使用 StandardScaler 时会得到不同的结果?

使用 StandardScaler() 规范化 pandas 数据帧,不包括特定列

scala 将元组解包到案例类参数和附加的 zip 两个序列中

为啥此控件不将结果保存回数据库?

为啥不将其添加到数据库中?

为啥 NLog 不将范围数据记录到 Application Insights 自定义维度