spark 提交 pyspark 脚本上的纱线投掷超过最大递归深度

Posted

技术标签:

【中文标题】spark 提交 pyspark 脚本上的纱线投掷超过最大递归深度【英文标题】:spark submit pyspark script on yarn throwing maximum recursion depth exceeded 【发布时间】:2020-09-04 21:14:20 【问题描述】:

我可以在 spark 提交 yarn-cluster 模式下提交 org.apache.spark.examples.SparkPi 示例 jar 并且它成功,但是在 pyspark 中低于 sn-p 失败并出现最大递归深度超出错误。

spark-submit --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 4 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON="/usr/bin/python2.7" test.py --verbose

我根据Pyspark on yarn-cluster mode 的建议添加了 pyspark_python 环境

test.py

from pyspark import SparkContext
from pyspark.sql import HiveContext

sc_new = SparkContext()
SQLContext = HiveContext(sc_new)
SQLContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
txt = SQLContext.sql( "SELECT 1")
txt.show(2000000, False)

如何解决这个问题?

File "/hdfs/data_06/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123855/container_e59_1583989737267_1123855_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
                   raise Py4JError("Answer from Java side is empty")
               Py4JError: Answer from Java side is empty
               ERROR:py4j.java_gateway:Error while sending or receiving.
               Traceback (most recent call last):File "/hdfs/data_10/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123601/container_e59_1583989737267_1123601_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 626, in send_command
File "/hdfs/data_10/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123601/container_e59_1583989737267_1123601_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 749, in send_command
File "/usr/lib64/python2.7/logging/__init__.py", line 1182, in exception
  self.error(msg, *args, **kwargs)
File "/usr/lib64/python2.7/logging/__init__.py", line 1175, in error
  self._log(ERROR, msg, args, **kwargs)
File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log
  self.handle(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle
  self.callHandlers(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers
  hdlr.handle(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 749, in handle
  self.emit(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 879, in emit
  self.handleError(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 802, in handleError
  None, sys.stderr)
File "/usr/lib64/python2.7/traceback.py", line 125, in print_exception
  print_tb(tb, limit, file)
File "/usr/lib64/python2.7/traceback.py", line 69, in print_tb
  line = linecache.getline(filename, lineno, f.f_globals)
File "/usr/lib64/python2.7/linecache.py", line 14, in getline
  lines = getlines(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 40, in getlines
  return updatecache(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 128, in updatecache
  lines = fp.readlines()
RuntimeError: maximum recursion depth exceeded while calling a Python object
运行 Spark 版本 1.6.0 蜂巢,版本 1.1.0 Hadoop 版本:2.6.0-cdh5.13.0

【问题讨论】:

【参考方案1】:

通过调用txt.show(2000000, False),您正在调用py4j 以调用to-and-fro jvm-python-object-jvm,而您的结果没有那么多行。 我相信你可以打电话给show() 的最大值是 2000-ish。 当您所做的只是SELECT 1 时,为什么您需要显示 2000000 条记录?

【讨论】:

这不应该是失败的原因,但我用 20 重试了,但仍然出现错误。使用日志开头的更多信息更新了原始帖子错误。谢谢

以上是关于spark 提交 pyspark 脚本上的纱线投掷超过最大递归深度的主要内容,如果未能解决你的问题,请参考以下文章

如何解决 Spark 上的纱线容器尺寸问题?

每次在纱线中执行批处理作业时都会创建 Spark 上下文

提交火花期间 pyspark 出现 Windows Spark_Home 错误

提交 pyspark 作业时出现语法错误

如何在 Pyspark 中运行 Python 脚本

YARN 集群上的 PySpark 分布式处理