PySpark pandas_udfs java.lang.IllegalArgumentException错误

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PySpark pandas_udfs java.lang.IllegalArgumentException错误相关的知识,希望对你有一定的参考价值。

有人在Windows上运行的本地pyspark会话上使用pandas UDFs有经验吗?我已经在Linux上使用了它们,并取得了不错的效果,但是在Windows机器上却没有成功。

环境:

python==3.7
pyarrow==0.15
pyspark==2.3.4
pandas==0.24

java version "1.8.0_74"

示例脚本:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", "false")

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))


@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())


out_df = df.groupby("id").apply(subtract_mean).toPandas()
print(out_df.head())

# +---+----+
# | id|   v|
# +---+----+
# |  1|-0.5|
# |  1| 0.5|
# |  2|-3.0|
# |  2|-1.0|
# |  2| 4.0|
# +---+----+

运行了很长一段时间后(将toPandas阶段分成200个任务,每个任务耗时一秒),它返回如下错误:

Traceback (most recent call last):
  File "C:miniconda3envspandas_udflibsite-packagespysparksqldataframe.py", line 1953, in toPandas
    tables = self._collectAsArrow()
  File "C:miniconda3envspandas_udflibsite-packagespysparksqldataframe.py", line 2004, in _collectAsArrow
    sock_info = self._jdf.collectAsArrowToPython()
  File "C:miniconda3envspandas_udflibsite-packagespy4jjava_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:miniconda3envspandas_udflibsite-packagespysparksqlutils.py", line 63, in deco
    return f(*a, **kw)
  File "C:miniconda3envspandas_udflibsite-packagespy4jprotocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o62.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 69 in stage 3.0 failed 1 times, most recent failure: Lost task 69.0 in stage 3.0 (TID 201, localhost, executor driver): java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(Unknown Source)
    at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104)
    at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128)
    at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
    at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
    at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:161)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.hasNext(ArrowConverters.scala:96)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.foreach(ArrowConverters.scala:94)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.to(ArrowConverters.scala:94)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toBuffer(ArrowConverters.scala:94)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toArray(ArrowConverters.scala:94)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
答案

您在java.lang.IllegalArgumentException中的pandas_udfpyarrow版本有关,而不与OS环境有关。有关详情,请参见此issue

您有两次行动:

  1. pyarrow降级到v.0.14,或
  2. 将环境变量ARROW_PRE_0_15_IPC_FORMAT=1添加到SPARK_HOME/conf/spark-env.sh

以上是关于PySpark pandas_udfs java.lang.IllegalArgumentException错误的主要内容,如果未能解决你的问题,请参考以下文章

PySpark。将 Dataframe 传递给 pandas_udf 并返回一个系列

如何在 pyspark.sql.functions.pandas_udf 和 pyspark.sql.functions.udf 之间进行选择?

如何在 Pyspark 中使用 @pandas_udf 返回多个数据帧?

为啥我的应用程序不能以 pandas_udf 和 PySpark+Flask 开头?

在pyspark的pandas_udf中使用外部库

在 PySpark 中使用 pandas_udf 时无法填充数组