带有数据类型字符串的 Spark DataFrame CountVectorizedModel 错误

Posted

技术标签:

【中文标题】带有数据类型字符串的 Spark DataFrame CountVectorizedModel 错误【英文标题】:Spark DataFrame CountVectorizedModel Error With DataType String 【发布时间】:2022-01-05 00:25:30 【问题描述】:

我有以下代码尝试执行一个简单的操作,我尝试将稀疏向量转换为密集向量。这是我目前所拥有的:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer, OneHotEncoder
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._

// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()

val cvmDF = new CountVectorizerModel(tags)
  .setInputCol("ocean_proximity")
  .setOutputCol("sparseFeatures")
  .transform(dfRaw)
  
val exprs = (0 until distinctOceanProximities.size).map(i => $"features".getItem(i).alias(s"$distinctOceanProximities(i)"))
val vecToSeq = udf((v: Vector) => v.toArray)

val df2 = cvmDF.withColumn("features", vecToSeq($"sparseFeatures")).select(exprs:_*)
df2.show()

当我运行此脚本时,我收到以下错误:

java.lang.IllegalArgumentException: requirement failed: Column ocean_proximity must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:63)
  at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema(CountVectorizer.scala:97)
  at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema$(CountVectorizer.scala:95)
  at org.apache.spark.ml.feature.CountVectorizerModel.validateAndTransformSchema(CountVectorizer.scala:272)
  at org.apache.spark.ml.feature.CountVectorizerModel.transformSchema(CountVectorizer.scala:338)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
  at org.apache.spark.ml.feature.CountVectorizerModel.transform(CountVectorizer.scala:306)
  ... 101 elided

我认为它期待数据类型的字符串序列,但我只有一个字符串。任何想法如何解决这个问题?

【问题讨论】:

【参考方案1】:

这很简单。我所要做的就是将列从字符串转换为字符串数组,如下所示:

val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))

【讨论】:

以上是关于带有数据类型字符串的 Spark DataFrame CountVectorizedModel 错误的主要内容,如果未能解决你的问题,请参考以下文章

大负十进制值在带有十进制类型的spark DataFrame中取整

将带有字符串索引的运行数字添加到 Spark 中的数据框?

Scala Spark groupBy/Agg 函数

Spark之处理布尔数值和字符串类型的数据

将 DataFrame 的数据带回本地节点以在 spark/scala 中执行进一步操作(计数/显示)

数据分析--pandas DataFrame