当我在 Windows 7 中运行“first”或“take”方法时 pyspark 崩溃

Posted

技术标签:

【中文标题】当我在 Windows 7 中运行“first”或“take”方法时 pyspark 崩溃【英文标题】:The pyspark crash when i run `first` or `take` method in windows 7 【发布时间】:2015-11-18 09:36:13 【问题描述】:

我只是运行这些命令:

>>> lines = sc.textFile("C:\Users\elqstux\Desktop\dtop.txt")

>>> lines.count()  // this work fine

>>> lines.first()  // this crash

这是错误报告:

>>> lines.first()

    15/11/18 17:33:35 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393

    15/11/18 17:33:35 INFO DAGScheduler: Got job 21 (runJob at PythonRDD.scala:393)
    with 1 output partitions
    15/11/18 17:33:35 INFO DAGScheduler: Final stage: ResultStage 21(runJob at Pytho
    nRDD.scala:393)
    15/11/18 17:33:35 INFO DAGScheduler: Parents of final stage: List()
    15/11/18 17:33:35 INFO DAGScheduler: Missing parents: List()
    15/11/18 17:33:35 INFO DAGScheduler: Submitting ResultStage 21 (PythonRDD[28] at
     RDD at PythonRDD.scala:43), which has no missing parents
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=619
    446, maxMem=555755765
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24 stored as values in memor
    y (estimated size 4.7 KB, free 529.4 MB)
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(3067) called with curMem=624
    270, maxMem=555755765
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in
     memory (estimated size 3.0 KB, free 529.4 MB)
    15/11/18 17:33:35 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on
    localhost:55487 (size: 3.0 KB, free: 529.9 MB)
    15/11/18 17:33:35 INFO SparkContext: Created broadcast 24 from broadcast at DAGS
    cheduler.scala:861
    15/11/18 17:33:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
     21 (PythonRDD[28] at RDD at PythonRDD.scala:43)
    15/11/18 17:33:35 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks
    15/11/18 17:33:35 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33,
    localhost, PROCESS_LOCAL, 2148 bytes)
    15/11/18 17:33:35 INFO Executor: Running task 0.0 in stage 21.0 (TID 33)
    15/11/18 17:33:35 INFO HadoopRDD: Input split: file:/C:/Users/elqstux/Desktop/dt
    op.txt:0+112852
    15/11/18 17:33:36 INFO PythonRunner: Times: total = 629, boot = 626, init = 3, f
    inish = 0
    15/11/18 17:33:36 ERROR PythonRunner: Python worker exited unexpectedly (crashed
    )
    java.net.SocketException: Connection reset by peer: socket write error
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)
    15/11/18 17:33:36 ERROR PythonRunner: This may have been caused by a prior excep
    tion:
    java.net.SocketException: Connection reset by peer: socket write error
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)
    15/11/18 17:33:36 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 33)
    java.net.SocketException: Connection reset by peer: socket write error
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)
    15/11/18 17:33:36 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 33, loca
    lhost): java.net.SocketException: Connection reset by peer: socket write error
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)

    15/11/18 17:33:36 ERROR TaskSetManager: Task 0 in stage 21.0 failed 1 times; abo
    rting job
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have
     all completed, from pool
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Cancelling stage 21
    15/11/18 17:33:36 INFO DAGScheduler: ResultStage 21 (runJob at PythonRDD.scala:3
    93) failed in 0.759 s
    15/11/18 17:33:36 INFO DAGScheduler: Job 21 failed: runJob at PythonRDD.scala:39
    3, took 0.810138 s
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1317, in first

        rs = self.take(1)
      File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take
        res = self.context.runJob(self, takeUpToNumLeft, p)
      File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in ru
    nJob
        port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition
    s)
      File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g
    ateway.py", line 538, in __call__
      File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in d
    eco
        return f(*a, **kw)
      File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc
    ol.py", line 300, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
    api.python.PythonRDD.runJob.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s
    tage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID
    33, localhost): java.net.SocketException: Connection reset by peer: socket write
     error
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)

    Driver stacktrace:
            at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
    GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
    AGScheduler.scala:1271)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
    AGScheduler.scala:1270)
            at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
    scala:59)
            at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
            at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
    :1270)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
    1.apply(DAGScheduler.scala:697)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
    1.apply(DAGScheduler.scala:697)
            at scala.Option.foreach(Option.scala:236)
            at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
    ler.scala:697)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(D
    AGScheduler.scala:1496)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
    Scheduler.scala:1458)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
    Scheduler.scala:1447)
            at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
            at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567
    )
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
            at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
            at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
    java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
    sorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:483)
            at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
            at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
            at py4j.Gateway.invoke(Gateway.java:259)
            at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
            at py4j.commands.CallCommand.execute(CallCommand.java:79)
            at py4j.GatewayConnection.run(GatewayConnection.java:207)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketException: Connection reset by peer: socket write erro
    r
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
    )
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
    apply(PythonRDD.scala:283)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
            at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
    cala:239)

当我运行take时,它也崩溃了,我找不到原因,谁能帮助我?

【问题讨论】:

计数是否返回您期望的值?我发现将文本文件作为 rdds 导入会导致我没想到的格式。通常我必须格式化的大字符串。 我也有同样的问题。然而,这个错误似乎是随机的。多次运行takefirst 有时会成功,但对我来说大多失败。我在 python 3.5 上运行 【参考方案1】:

在 Windows 7 和 Spark 1.5.0 (Python 2.7.11) 上,我被同样的问题困扰了好几个小时。我只使用完全相同的构建解决了切换到 Unix 的问题。这不是一个优雅的解决方案,但我找不到任何其他方法来解决这个问题。

【讨论】:

以上是关于当我在 Windows 7 中运行“first”或“take”方法时 pyspark 崩溃的主要内容,如果未能解决你的问题,请参考以下文章

我在 python3.7/Windows 7 中编写的 Python 程序无法在 Windows XP 中运行

Logstash:“错误:运行logstash.bat时找不到或加载主类Heal”

尝试使用 Python 3.6.2 在 Windows 7 上创建虚拟环境

Node.js 在 Windows 7 中不运行 HelloWorld 程序

Visual Studio 需要在 Windows 7 中提升权限

在 Windows 7 中构建的 VC++ 程序不能在 Windows Xp 上运行