pyspark.ml:计算精度和召回时的类型错误

Posted

技术标签:

【中文标题】pyspark.ml:计算精度和召回时的类型错误【英文标题】:pyspark.ml: Type error when computing precision and recall 【发布时间】:2018-04-09 09:19:07 【问题描述】:

我正在尝试使用pyspark.ml 计算分类器的精度、召回率和 F1:

model = completePipeline.fit(training)
predictions = model.transform(test)

mm = MulticlassMetrics(predictions.select(["label", "prediction"]).rdd)


labels = sorted(predictions.select("prediction").rdd.distinct().map(lambda r: r[0]).collect())

for label in labels:
    print labels
    print "Precision = %s" % mm.precision(label=label) 
    print "Recall = %s" % mm.recall(label=label) 
    print "F1 Score = %s" % mm.fMeasure(label=label)

metrics = pandas.DataFrame([(label, mm.precision(label=label), mm.recall(label=label), mm.fMeasure(label=label)) for label in labels],
                            columns=["Precision", "Recall", "F1"])

生成的数据框predictions的架构:

[('features', 'vector'), ('label', 'int'), ('rawPrediction', 'vector'), ('probability', 'vector'), ('prediction', 'double')]

调用mm.precision触发的错误信息:

Traceback (most recent call last):
  File "ml_pipeline_factory_test", line 1, in <module>
  File "ml_pipeline_factory_test", line 92, in ml_pipeline_factory_test
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/evaluation.py", line 240, in precision
    return self.call("precision", float(label))
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/common.py", line 146, in call
    return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/common.py", line 123, in callJavaFunc
    return _java2py(sc, func(*args))
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o371.precision.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 22.0 failed 4 times, most recent failure: Lost task 7.3 in stage 22.0 (TID 153, dhbpdn12.de.t-internal.com, executor 4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/worker.py", line 245, in main
    process()
  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/worker.py", line 240, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/session.py", line 677, in prepare
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1421, in verify
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1402, in verify_struct
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1421, in verify
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1415, in verify_default
  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1310, in verify_acceptable_types
TypeError: field prediction: DoubleType can not accept object 0 in type <type 'int'>

【问题讨论】:

【参考方案1】:

如错误信息中所示:

TypeError: field prediction: DoubleType can not accept object 0 in type <type 'int'>

键入问题。虽然 intfloat 在 Python 中通常可以交换,但在 Java 中则不能。

最简单的解决方案是将label字段上游:

predictions = (predictions
    .withColumn("label", predictions["label"].cast("double")))

【讨论】:

问题是“标签”(('label', 'int'))而不是“预测”。请检查编辑是否有帮助 是的,修复了类型错误,谢谢。 分类器的评估期望标签是双精度类型是我还是奇怪? 是的,但它是标准的 Spark 行为。这一切都与实现的简单性有关。原始mllib API 使用LabeledPoint 用于分类和回归模型。这就是为什么在那里使用Double。此外,在封面下使用的所有低级库都对 FP 数字进行操作,并且不支持整数。 那里也会发生同样的事情(那里总是有一些 Blas,而且它不支持整数),它只是透明的。

以上是关于pyspark.ml:计算精度和召回时的类型错误的主要内容,如果未能解决你的问题,请参考以下文章

y_pred 和 y_true 具有不同大小时的精度、召回率、f 分数

在命名实体识别中计算精度和召回率

如何计算精度和召回率

计算超过 2 个类别的精度和召回率

如何计算聚类中的精度和召回率?

计算精度和召回率