Spark 独立集群轮胎访问本地 python.exe
Posted
技术标签:
【中文标题】Spark 独立集群轮胎访问本地 python.exe【英文标题】:Spark standalone cluster tires to access local python.exe 【发布时间】:2017-04-07 12:07:19 【问题描述】:在 spark 集群上执行 python 应用程序时遇到以下异常:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/04/07 10:57:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 2) / 2]17/04/07 10:57:07 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.2.113, executor 0): java.io.IOException: Cannot run program "C:\Users\<local-user>\Anaconda2\python.exe": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:120)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more
17/04/07 10:57:07 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<redacted-absolute-path>.py", line 49, in <module>
eco_rdd
File "C:\spark\python\pyspark\rdd.py", line 2139, in zipWithIndex
nums = self.mapPartitions(lambda it: [sum(1 for i in it)]).collect()
File "C:\spark\python\pyspark\rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\spark\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.2.113, executor 0): java.io.IOException: Cannot run program "C:\Users\<local-user>\Anaconda2\python.exe": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:120)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot run program "C:\Users\<local-user>\Anaconda2\python.exe": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:120)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more
集群(在同一网络中的远程 PC 上)以某种方式尝试访问本地 python(安装在执行驱动程序的本地工作站上):
Caused by: java.io.IOException: Cannot run program "C:\Users\<local-user>\Anaconda2\python.exe": CreateProcess error=2, The system cannot find the file specified
Spark 2.1.0
Spark 独立集群在 Windows 10 上运行
工作站在 Windows 7 上运行
连接到集群并使用spark-shell
(交互式)执行任务没有问题
连接到集群并使用pyspark
(交互式)执行任务没有问题
从pycharm运行直接导致上面的异常
使用spark-submit
执行会导致类似问题(=> 尝试访问我的本地 python)
正在使用 findspark
已设置 PYSPARK_PYTHON 和 PYSPARK_DRIVER_PYTHON 环境变量
集群和工作站上的 Python:Python 2.7.13 :: Anaconda custom(64 位)
提前感谢您的帮助
【问题讨论】:
只是一个疯狂的想法——您是否尝试在执行程序配置中强制使用PYSPARK_PYTHON
,即在您的spark-defaults.conf
中添加spark.executorEnv.PYSPARK_PYTHON C:\some\path\python.exe
(或通过--conf
在命令行中添加)?
我尝试以各种不同的方式设置该配置,但得到了同样的错误。在某些情况下,它尝试使用远程 python 实例执行本地驱动程序(=除此之外的另一个异常)。您必须在本地工作站上设置这些配置吗?为了确定,我尝试在(本地和远程)两者上都设置它。
两个环境变量都被spark-submit
实用程序使用;理论上,PYSPARK_PYTHON
的值应该由执行者使用,它应该指的是执行者机器上的 Python 安装;并且PYSPARK_DRIVER_PYTHON
的值(如果存在,否则默认为 PYSPARK_PYTHON 值)应该由驱动程序使用,它应该引用驱动程序机器上的 Python 安装 (可能与 spark-submit
机器不同-- 例如yarn-cluster
模式!)
请注意,Spark 2.1 文档提到了一种使用属性而不是 env 变量来选择 Python 的方法——也许这些属性也存在于 1.6 和 2.0 中,但没有记录在案,也许它是一个新功能...... @ 987654321@下spark.pyspark.python
等
感谢您的回答。我只是尝试添加spark.pyspark.driver.python
和spark.pyspark.python
属性。到目前为止没有任何变化。我在测试中主要使用该命令的变体:%SPARK_HOME%\bin\spark-submit --conf spark.pyspark.python=%PYSPARK_PYTHON% --conf spark.pyspark.driver.python=%PYSPARK_DRIVER_PYTHON% --conf spark.executorEnv.PYSPARK_PYTHON=%PYSPARK_PYTHON% --master spark://%SPARK_MASTER_HOST%:%SPARK_MASTER_PORT% --py-files some.zip <my-spark-application>.py
【参考方案1】:
所以我现在找到了解决问题的方法。对于我的设置,问题是 findspark 库。删除允许我在集群上运行我的所有任务。
如果您有类似的问题,但与 findspark 无关,我会向您指出上面 Samson Scharfrichter 的 cmets。
【讨论】:
以上是关于Spark 独立集群轮胎访问本地 python.exe的主要内容,如果未能解决你的问题,请参考以下文章