Scala 的位置:MatchError

Posted

技术标签:

【中文标题】Scala 的位置:MatchError【英文标题】:Location of Scala: MatchError 【发布时间】:2017-07-07 18:50:18 【问题描述】:
ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 20)
scala.MatchError: [0.0,(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

我在我的 scala 程序中看到这个错误,我试图使用 NaiveBayes 分类器对电影评论进行分类。我在尝试训练 NaiveBayes Classifer 时看到了这个错误。我无法更正此错误,因为我不知道分类器期望的数据类型。 NaiveBayes 的文档说它需要一个 RDD 条目,这就是我所拥有的。任何帮助将不胜感激。请为这个电影评论分类程序找到我的完整 SCALA 代码。

PS:请忽略代码中可能出现的缩进错误。它就在我的程序文件中。提前致谢。

    import  org.apache.spark.sql.Dataset, DataFrame, SparkSession
    import org.apache.spark.SparkConf, SparkContext
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql._
    import org.apache.spark.ml.feature.HashingTF, IDF, Tokenizer, PCA
    import org.apache.spark.mllib.classification.NaiveBayes,NaiveBayesModel
    import org.apache.spark.mllib.util.MLUtils
    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.linalg._


    //Reading the file from csv into dataframe object
    val sqlContext = new SQLContext(sc)
    val df = sqlContext.read.option("header", "true").option("delimiter",",").option("inferSchema", "true").csv("movie-pang02.csv")


     //Tokenizing the data by splitting the text into words
     val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
     val wordsData = tokenizer.transform(df)


     //Hashing the data by converting the words into rawFeatures
     val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)
     val featurizedData = hashingTF.transform(wordsData)


      //Applying Estimator on the data which converts the raw features into features by scaling each column
      val idf  = new IDF().setInputCol("rawFeatures").setOutputCol("features")
      val idfModel = idf.fit(featurizedData)
      val rescaledData = idfModel.transform(featurizedData)

      val coder: (String => Int) = (arg: String) => if (arg == "Pos") 1 else 0
      val sqlfunc = udf(coder)
      val new_set = rescaledData.withColumn("label", sqlfunc(col("class")))

      val EntireDataRdd = new_set.select("label","features").mapcase Row(label: Int, features: Vector) =>  LabeledPoint(label.toDouble, Vectors.dense(features.toArray))


    //Converted the data into RDD<LabeledPoint> format so as to input it into the inbuilt Naive Bayes classifier
    val labeled = EntireDataRdd.rdd
    val Array(trainingData, testData) = labeled.randomSplit(Array(0.7, 0.3), seed = 1234L)
    //Error in the following statement
    val model = NaiveBayes.train(trainingData, lambda = 1.0, modelType = "multinomial")

    val predictionAndLabel = testData.map(p => (model.predict(p.features), p.label))
    val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / testData.count()
    val testErr = predictionAndLabel.filter(r => r._1 != r._2).count.toDouble / testData.count()

【问题讨论】:

【参考方案1】:

这是一个痛苦的(并且并不少见)陷阱 - 您将 Row 的内容与错误的 Vector 类匹配 - 它应该是 org.apache.spark.ml.linalg.Vector 而不是 org.apache.spark.mllib.linalg.Vector...(是的 -令人沮丧!)

在映射之前添加正确的导入解决了这个问题:

import org.apache.spark.ml.linalg.Vector // and not org.apache.spark.mllib.linalg.Vector!
import org.apache.spark.mllib.linalg.Vectors // and not org.apache.spark.ml.linalg.Vectors!

val EntireDataRdd = new_set.select("label","features").map 
  case Row(label: Int, features: Vector) =>  LabeledPoint(label.toDouble, Vectors.dense(features.toArray))

【讨论】:

“这是一个痛苦的(而且并不罕见)的陷阱”。再好不过了。虽然,我已经尝试过了,这是我遇到的错误:发现:org.apache.spark.ml.linalg.Vector required: org.apache.spark.mllib.linalg.Vector val EntireDataRdd = new_set.select("label ","features").mapcase Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) 无需感到永远负债 - 接受答案就足够了 ;) 我已经尝试过了,这是我遇到的错误:找到:org.apache.spark.ml.linalg.Vector required: org.apache.spark.mllib.linalg.Vector val EntireDataRdd = new_set.select("label","features").mapcase Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) 这个错误出现在下面行:val EntireDataRdd = new_set.select("label","features").mapcase Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) 嗯...确保您导入Vector,但 Vectors:您应该匹配org.apache.spark.ml.linalg.Vector,但使用org.apache.spark.mllib.linalg.Vectors创建密集向量。 . 实际上,您会将ml 向量转换 为密集的mllib 向量。我编辑了答案以澄清这一点......

以上是关于Scala 的位置:MatchError的主要内容,如果未能解决你的问题,请参考以下文章

Scala 光滑查询在列表中的位置

在集合 Scala 2.11.12 中的某些位置连接地图

Scala Quill自定义配置位置

根据来自其他数据帧的位置条件在数据帧上编写选择查询,scala

Spark Scala创建外部配置单元表不使用位置作为变量

如何在 sbt 项目中使用来自不同位置的 scala 源并使其与 IntelliJ IDEA 一起使用?