Windows上的Spark--rdd.count()不起作用[重复]

Posted

技术标签:

【中文标题】Windows上的Spark--rdd.count()不起作用[重复]【英文标题】:Spark On Windows--rdd.count() doesn't work [duplicate] 【发布时间】:2018-09-10 03:08:39 【问题描述】:

我主要是按照“Frank Kane 用 Apache Spark 和 Python 驯服大数据”一书中的说明在 Windows 上安装 Spark。它们似乎与我在网上找到的其他说明一致。它涉及安装 java、python、scala 和 spark,以及设置环境变量和路径。我能够运行java和python。为了运行 pyspark,我必须运行 pyspark.cmd,我使用的是 Canopy 命令提示符)。这确实激发了火花。

然后我运行: rdd = sc.textFile("README.md")

然后

rdd.count()

但我收到此错误:

>>> rdd = sc.textFile("README.md")
>>> rdd.count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\spark\python\pyspark\rdd.py", line 1073, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "C:\spark\python\pyspark\rdd.py", line 1064, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "C:\spark\python\pyspark\rdd.py", line 935, in fold
    vals = self.mapPartitions(func).collect()
  File "C:\spark\python\pyspark\rdd.py", line 834, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "C:\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.IllegalArgumentException
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
        at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:844)

我有什么问题?

【问题讨论】:

【参考方案1】:

我认为您需要将 README 文件放入 hdfs 或使用带有前缀“file://”的文件的完整路径。

【讨论】:

以上是关于Windows上的Spark--rdd.count()不起作用[重复]的主要内容,如果未能解决你的问题,请参考以下文章

Windows Server 2012 R2 上的 Windows 服务 PlatformNotSupportedException 上的 WCF

Windows 10上的IE 11 VS 10上的Windows 11

WINDOWS“任务栏”上的内容为

在Windows 上的 Python

如何从 Windows 10 上的容器连接到 docker 主机(Docker for Windows)

Windows 上的 WaitOnAddress() 在 Linux 上的完全等价物是啥?