用于数据清理的 VarianceThreshold 函数
Posted
技术标签:
【中文标题】用于数据清理的 VarianceThreshold 函数【英文标题】:VarianceThreshold Function For Data Cleansing 【发布时间】:2021-11-11 19:51:45 【问题描述】:我想使用以下函数来查看基于方差的不同阈值选择了多少特征。
导入 org.apache.spark.ml.feature.VarianceThresholdSelector
def varianceThreshold(df: DataFrame, thresholds: Seq[Threshold]): Seq[(Threshold, DataFrame)] =
thresholds.map(threshold =>
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(threshold)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
(threshold, selector.fit(df).transform(df))
)
到目前为止一切顺利。我有一个如下所示的 DataFrame:
现在我的问题是,如果 col2 是预测变量,即我试图预测的值,那么我怎样才能将所有其他列分组,以便我可以将其作为特征传递。例如,我从 Spark 文档中看到了这个示例:
import org.apache.spark.ml.feature.VarianceThresholdSelector
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
(6, Vectors.dense(8.0, 9.0, 6.0, 0.0, 0.0, 0.0))
)
val df = spark.createDataset(data).toDF("id", "features")
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(8.0)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
println(s"Output: Features with variance lower than" +
s" $selector.getVarianceThreshold are removed.")
result.show()
那么对于我的示例来说,featureCol 是什么,或者更确切地说,我怎样才能将我的各个列作为 featuresCol 数组?
【问题讨论】:
【参考方案1】:这是我为达到我想要的效果所做的:
type Threshold = Double
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VarianceThresholdSelector
def varianceThreshold(df: DataFrame, thresholds: Seq[Threshold]): Seq[(Threshold, DataFrame)] =
val assembler = new VectorAssembler()
.setInputCols(df.columns.tail)
.setOutputCol("features")
val output = assembler.transform(df)
thresholds.map(threshold =>
val selector = new VarianceThresholdSelector()
.setVarianceThreshold(threshold)
.setFeaturesCol("features")
.setOutputCol("selectedFeatures")
(threshold, selector.fit(output).transform(output))
)
【讨论】:
以上是关于用于数据清理的 VarianceThreshold 函数的主要内容,如果未能解决你的问题,请参考以下文章
ImportError:无法导入名称 VarianceThreshold