Spark:如何获得伯努利朴素贝叶斯的概率和 AUC?

Posted

技术标签:

【中文标题】Spark:如何获得伯努利朴素贝叶斯的概率和 AUC?【英文标题】:Spark: How to get probabilities and AUC for Bernoulli Naive Bayes? 【发布时间】:2016-02-26 16:22:04 【问题描述】:

我正在使用代码运行Bernoulli Naive Bayes

val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")

我的问题是如何获得 0 级(或 1 级)成员的概率并计算 AUC。我想得到与我使用此代码的LogisticRegressionWithSGDSVMWithSGD 类似的结果:

val numIterations = 100

val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()

// Compute raw scores on the test set.
val labelAndPreds = test.map  point =>
      val prediction = model.predict(point.features)
      (prediction, point.label)


// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC() 

很遗憾,此代码不适用于 NaiveBayes

【问题讨论】:

好的,这是一个二合一问题。那么您使用的是哪个版本的火花?你还想要什么概率? Spark 1.5.0。我要P(Y=0|X),有了这个我就可以算AUC了,对吗? 是的,这是一个二元分类 我正在使用 spark.mllib 【参考方案1】:

关于伯努利朴素贝叶斯的概率,这里有一个例子:

// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))

// Transforming dummy data into LabeledPoint
val parsedData = data.map  line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))


// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")

// labels 
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println

// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector $testVector => probability : $model.predictProbabilities(testVector)")

至于 AUC:

// Compute raw scores on the test set.
val labelAndPreds = test.map  point =>
  val prediction = model.predict(point.features)
  (prediction, point.label)


// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()

关于聊天中的询问:

val results = parsedData.map  lp =>
  val probs: Vector = model.predictProbabilities(lp.features)
  (for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
.flatMap(identity)

results.take(10).foreach(println)

// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)

如果您只对 argmax 类感兴趣:

val results = training.map  lp => val probs: Vector = model.predictProbabilities(lp.features)
  val bestClass = probs.argmax
  (labels(bestClass), probs(bestClass))

results.take(10) foreach println

// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)

注意:适用于 Spark 1.5+

编辑:(对于 Pyspark 用户)

似乎有些人在使用 pysparkmllib 获得概率时遇到了麻烦。这很正常,spark-mllib 没有为 pyspark 提供该功能。

因此,您需要使用基于 spark-ml DataFrame 的 API:

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes

df = spark.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])

nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)

model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction                            |probability                             |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0  |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0       |
# |[0.0,1.0]|0.0  |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0       |
# |[1.0,0.0]|1.0  |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0       |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+

您只需要选择您的 预测 列并计算您的 AUC。

更多关于spark-ml中朴素贝叶斯的信息,请参考官方文档here。

【讨论】:

非常感谢!我稍微改了一下,现在可以得到(label,P(y=0|x)):val results = test.map lp =&gt; val probs: Vector = model.predictProbabilities(lp.features) val MyList = List.range(0,(probs.size - 1),2) (for (i &lt;- MyList) yield ((lp.label, probs(i)))) .flatMap(identity)

以上是关于Spark:如何获得伯努利朴素贝叶斯的概率和 AUC?的主要内容,如果未能解决你的问题,请参考以下文章

分类-朴素贝叶斯(高斯多项式伯努利)

朴素贝叶斯分类器伯努利模型

贝叶斯方法—高斯,多项式,伯努利朴素贝叶斯分类

朴素贝叶斯:朴素贝叶斯定义朴素贝叶斯公式分解朴素贝叶斯分类流程高斯型朴素贝叶斯多项式朴素贝叶斯伯努利型朴素贝叶斯朴素贝叶斯预测概率校准朴素贝叶斯优缺点

用于朴素贝叶斯分类器的伯努利模型的拉普拉斯平滑

sklearn-朴素贝叶斯