为啥朴素贝叶斯不能像逻辑回归一样在 Spark MLlib 管道中工作?
Posted
技术标签:
【中文标题】为啥朴素贝叶斯不能像逻辑回归一样在 Spark MLlib 管道中工作?【英文标题】:Why does Naive Bayes not work in Spark MLlib Pipeline like Logistic Regression?为什么朴素贝叶斯不能像逻辑回归一样在 Spark MLlib 管道中工作? 【发布时间】:2017-05-08 18:14:10 【问题描述】:我正在研究使用 Spark 和 Scala 对推文进行情感分析的问题。我有一个使用逻辑回归模型的工作版本,如下所示:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StructType, StructField, StringType, IntegerType;
import org.apache.spark.mllib.classification.NaiveBayes, NaiveBayesModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.CountVectorizer, RegexTokenizer, StopWordsRemover
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val sqlContext = new SQLContext(sc)
// Sentiment140 training corpus
val trainFile = "s3://someBucket/training.1600000.processed.noemoticon.csv"
val swFile = "s3://someBucket/stopwords.txt"
val tr = sc.textFile(trainFile)
val stopwords: Array[String] = sc.textFile(swFile).flatMap(_.stripMargin.split("\\s+")).collect ++ Array("rt")
val parsed = tr.filter(_.contains("\",\"")).map(_.split("\",\"").map(_.replace("\"", ""))).filter(row => row.forall(_.nonEmpty)).map(row => (row(0).toDouble, row(5))).filter(row => row._1 != 2).map(row => (row._1 / 4, row._2))
val pDF = parsed.toDF("label","tweet")
val tokenizer = new RegexTokenizer().setGaps(false).setPattern("\\pL+").setInputCol("tweet").setOutputCol("words")
val filterer = new StopWordsRemover().setStopWords(stopwords).setCaseSensitive(false).setInputCol("words").setOutputCol("filtered")
val countVectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(50).setRegParam(0.2).setElasticNetParam(0.0)
val pipeline = new Pipeline().setStages(Array(tokenizer, filterer, countVectorizer, lr))
val lrModel = pipeline.fit(pDF)
// Now model is made. Lets get some test data...
val testFile = "s3://someBucket/testdata.manual.2009.06.14.csv"
val te = sc.textFile(testFile)
val teparsed = te.filter(_.contains("\",\"")).map(_.split("\",\"").map(_.replace("\"", ""))).filter(row => row.forall(_.nonEmpty)).map(row => (row(0).toDouble, row(5))).filter(row => row._1 != 2).map(row => (row._1 / 4, row._2))
val teDF = teparsed.toDF("label","tweet")
val res = lrModel.transform(teDF)
val restup = res.select("label","prediction").rdd.map(r => (r(1).asInstanceOf[Double], r(0).asInstanceOf[Double]))
val metrics = new BinaryClassificationMetrics(restup)
metrics.areaUnderROC()
使用逻辑回归,这将返回完全正常的 AUC。但是,当我从逻辑回归切换到 val nb = new NaiveBayes() 时,出现以下错误:
found : org.apache.spark.mllib.classification.NaiveBayes
required: org.apache.spark.ml.PipelineStage
val pipeline = new Pipeline().setStages(Array(tokenizer, filterer, countVectorizer, nb))
在咨询 MLlib PipelineStage 上的 API 文档时,逻辑回归和朴素贝叶斯都被列为子类。那么为什么 LR 有效,而 NB 无效呢?
【问题讨论】:
【参考方案1】:它不起作用,因为您使用了不正确的类。使用管道:
org.apache.spark.ml.NaiveBayes
并咨询the documentation 以获得正确的语法。
【讨论】:
啊。管道不适用于较旧的 .mllib 包(我在一些遗留代码中与 NB 一起使用),但适用于 .ml 包(我用于 LR 模型)。对上述内容稍作修正......它是 org.apache.spark.ml.classification.NaiveBayes。 我想知道为什么我会草率地使用带有 HashingTF 的包也来自 mllib 而不是 ml?呃,好吧。 :)以上是关于为啥朴素贝叶斯不能像逻辑回归一样在 Spark MLlib 管道中工作?的主要内容,如果未能解决你的问题,请参考以下文章