无法让 Spark 在 Windows 中的 IPython Notebook 上工作

Posted

技术标签:

【中文标题】无法让 Spark 在 Windows 中的 IPython Notebook 上工作【英文标题】:Can't get Spark to work on IPython Notebook in Windows 【发布时间】:2016-02-05 18:06:07 【问题描述】:

我已经在 Windows 10 机器上安装了 spark,并且可以从 Pyspark 控制台正常安装。但最近我尝试将 Ipython Notebook 配置为与 Spark 安装一起使用。我做了以下导入

os.environ['SPARK_HOME'] = "E:/Spark/spark-1.6.0-bin-hadoop2.6"
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/bin")
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python")
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/pyspark")
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib")
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-    src.zip")
sys.path.append("C:/Program Files/Java/jdk1.8.0_51/bin")

这适用于创建 SparkContext 以及类似的代码

sc.parallelize([1, 2, 3])

但是当我写以下内容时

file = sc.textFile("E:/scripts.sql")
words = sc.count()

我收到以下错误

Py4JJavaError Traceback (most recent call last)
<ipython-input-22-3c172daac960> in <module>()
 1 file = sc.textFile("E:/scripts.sql")
 ----> 2 file.count()

 E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in count(self)
 1002         3
 1003         """
 -> 1004         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 1005 
 1006     def stats(self):

 E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in sum(self)
 993         6.0
 994         """
 --> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
 996 
 997     def count(self):

 E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in fold(self, zeroValue, op)
 867         # zeroValue provided to each partition is unique from the one provided
 868         # to the final reduce call
 --> 869         vals = self.mapPartitions(func).collect()
 870         return reduce(op, vals, zeroValue)
 871 

 E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in collect(self)
 769         """
 770         with SCCallSiteSync(self.context) as css:
 --> 771             port =     self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
 772         return list(_load_from_socket(port, self._jrdd_deserializer))
 773 

 E:\Spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args)
 811         answer = self.gateway_client.send_command(command)
 812         return_value = get_return_value(
 --> 813             answer, self.gateway_client, self.target_id, self.name)
 814 
 815         for temp_arg in temp_args:

 E:\Spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
 306                 raise Py4JJavaError(
 307                     "An error occurred while calling 012.\n".
 --> 308                     format(target_id, ".", name), value)
309             else:
310                 raise Py4JError(Py4JJavaError: An error occurred while calling     z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8, localhost): org.apache.spark.SparkException: Python worker did not connect back in time at   org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:136)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:131)
... 12 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker did not connect back in time
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:136)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:131)
... 12 more

请帮助解决这个问题,因为我正在进行一个短期项目。

【问题讨论】:

【参考方案1】:

尝试转义反斜杠。

file = sc.textFile("E:\\scripts.sql")

编辑添加了第二个项目来查看:

另外,我注意到你打电话给:

words = sc.count()

试试这个,它适用于我的 Windows 10 安装:

file = sc.textFile("E:/scripts.sql")
words = file.count()

【讨论】:

已经尝试过,但无济于事。我认为相关的错误在于错误信息Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException: Python worker did not connect back in time @Maurer 仍然没有前进的方向。我将 ipython notebook 更新为 jupyter 但无济于事。错误仍然以 ... Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 开头

以上是关于无法让 Spark 在 Windows 中的 IPython Notebook 上工作的主要内容,如果未能解决你的问题,请参考以下文章

win7系统中的windows time服务无法启动怎么办

windows的各种应用

windows 2012 server r2无法访问互联网

我如何让 pandas 使用 spark 集群

如何让 Spark SQL 导入没有“L”后缀的 Long?

如何让Mac,Windows可以互相远程