错误“AttributeError:'Py4JError'对象没有属性'message'构建DecisionTreeModel
Posted
技术标签:
【中文标题】错误“AttributeError:\'Py4JError\'对象没有属性\'message\'构建DecisionTreeModel【英文标题】:Error "AttributeError: 'Py4JError' object has no attribute 'message' building DecisionTreeModel错误“AttributeError:'Py4JError'对象没有属性'message'构建DecisionTreeModel 【发布时间】:2017-05-07 19:06:51 【问题描述】:我正在学习 O'Reilly 的“使用 Spark 进行高级分析”的第 4 章。这本书是用 Scala 编写的,我无法将此代码转换为 Python。
Scala 代码
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression._
val rawData = sc.textFile("hdfs:///user/ds/covtype.data")
val data = rawData.map line =>
val values = line.split(',').map(_.toDouble)
val featureVector = Vectors.dense(values.init)
val label = values.last - 1
LabeledPoint(label, featureVector)
val Array(trainData, cvData, testData) =
data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache()
cvData.cache()
testData.cache()
import org.apache.spark.mllib.evaluation._
import org.apache.spark.mllib.tree._
import org.apache.spark.mllib.tree.model._
import org.apache.spark.rdd._
def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]):
MulticlassMetrics =
val predictionsAndLabels = data.map(example =>
(model.predict(example.features), example.label)
)
new MulticlassMetrics(predictionsAndLabels)
val model = DecisionTree.trainClassifier(
trainData, 7, Map[Int,Int](), "gini", 4, 100)
val metrics = getMetrics(model, cvData)
metrics.confusionMatrix
我的 Python 代码
from pyspark.sql.functions import col, split
import pyspark.mllib.linalg as linal
import pyspark.mllib.regression as regre
import pyspark.mllib.evaluation as eva
import pyspark.mllib.tree as tree
import pyspark.rdd as rd
raw_data = sc.textFile("covtype.data")
def fstDecisionTree(line):
values = list(map(float,line.split(",")))
featureVector = linal.Vectors.dense(values[:-1])
label = values[-1]-1
ret=regre.LabeledPoint(label, featureVector)
return regre.LabeledPoint(label, featureVector)
data = raw_data.map(fstDecisionTree)
trainData,cvData,testData=data.randomSplit([0.8,0.1,0.1])
trainData.cache()
cvData.cache()
testData.cache()
def help_lam(model):
def _help_lam(dataline):
print(dataline)
a=dataline.collect()
return (model.predict(a[1]),a[0])
return _help_lam
def getMetrics(model, data):
print(type(model),type(data))
predictionsAndLabels= data.map(help_lam(model))
return eva.MulticlassMetrics(predictionsAndLabels)
n_targets=7
max_depth=4
max_bin_count=100
model = tree.DecisionTree.trainClassifier(trainData, n_targets, , "gini", max_depth, max_bin_count)
metrics=getMetrics(model,cvData)
当我运行此程序时,当我尝试隐式传递地图迭代时,def help_lam(model)
内部的方法 def _help_lam(dataline)
出现此错误:
AttributeError: 'Py4JError' object has no attribute 'message'
【问题讨论】:
【参考方案1】:我认为问题出在model.predict
函数中
来自pyspark mllib/tree.py
注意:在 Python 中,predict 目前不能在 RDD 中使用 转变或行动。 而是直接在 RDD 上调用 predict。
你能做的就是像这样直接传递特征向量
>>> rdd = sc.parallelize([[1.0], [0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]
编辑:
您的getMetrics
的更新可能是:
def getMetrics(model, data):
labels = data.map(lambda d: d.label)
features = data.map(lambda d: d.features)
predictions = model.predict(features)
predictionsAndLabels = predictions.zip(labels)
return eva.MulticlassMetrics(predictionsAndLabels)
【讨论】:
以上是关于错误“AttributeError:'Py4JError'对象没有属性'message'构建DecisionTreeModel的主要内容,如果未能解决你的问题,请参考以下文章