pyspark:ImportError:没有名为 numpy 的模块
Posted
技术标签:
【中文标题】pyspark:ImportError:没有名为 numpy 的模块【英文标题】:pyspark: ImportError: No module named numpy 【发布时间】:2016-12-23 19:02:37 【问题描述】:我正在使用pyspark,并从以下代码得到结果rdd:
import numpy
model = PrefixSpan.train(input_rdd,minSupport=0.1)
result = model.freqSequences().filter(lambda x: (x.freq >= 50)).filter(lambda x: (len(x.sequence) >=2) ).cache()
当我检查input_rdd.take(5)
时,input_rdd
看起来不错。上面的代码创建了一个名为'result'的rdd,格式如下:
PythonRDD[103] at RDD at PythonRDD.scala:48
我确实安装了 numpy,但是当我尝试执行 result.take(5)
或 result.count()
时,我不断收到以下错误。
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-32-7e589dce550c> in <module>()
----> 1 result.take(5)
/usr/local/spark-latest/python/pyspark/rdd.py in take(self, num)
1308
1309 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1310 res = self.context.runJob(self, takeUpToNumLeft, p)
1311
1312 items += res
/usr/local/spark-latest/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
939 # SparkContext#runJob.
940 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 941 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
942 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
943
/usr/local/spark-latest/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark-latest/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling 012.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 85, ph-hdp-abc-dn07): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data/0/yarn/nm/usercache/abc-test/appcache/application_1482412711394_0011/container_e16_1482412711394_0011_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/data/0/yarn/nm/usercache/abc-test/appcache/application_1482412711394_0011/container_e16_1482412711394_0011_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/data/0/yarn/nm/usercache/abc-test/appcache/application_1482412711394_0011/container_e16_1482412711394_0011_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/data/0/yarn/nm/usercache/abc-test/appcache/application_1482412711394_0011/container_e16_1482412711394_0011_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/data/0/yarn/nm/usercache/abc-test/appcache/application_1482412711394_0011/container_e16_1482412711394_0011_01_000002/pyspark.zip/pyspark/mllib/__init__.py", line 28, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
有人知道我错过了什么吗?谢谢!
【问题讨论】:
您能否说明您是如何到达RDD
的。该错误是预先引起的,因此当您调用它时它无法向您显示 RDD 的元素。
@DmitryPolonskiy:我已经修改了问题以包括结果 rdd 是如何生成的。请指教。谢谢!
@user 所说的可能是有效的,但您在尝试收集 result
时遇到错误这一事实可能意味着您用于到达 @ 的代码存在错误987654330@.
【参考方案1】:
如果驱动程序端导入没有失败,如果意味着 numpy
在执行程序解释器中不可访问。在少数情况下会出现这种情况:
numpy
(工作程序节点上缺少numpy
)。
您在工作节点上安装了numpy
,但工作人员配置不正确:
numpy
已安装,但在解释器路径中缺失。
numpy
已安装,但工作人员使用的环境/解释器与已安装 numpy
的环境/解释器不同。
【讨论】:
【参考方案2】:我复制了您的代码,看来@user7337271 是正确的。这个特定的模块需要numpy
才能工作,如源代码的前几行所示。这是我的代码,用于验证 numpy
可能仅安装在您的主节点上确实存在问题。
import numpy
from pyspark.mllib.fpm import PrefixSpan
data = [[["a", "b"], ["c"]],[["a"], ["c", "b"], ["a", "b"]],[["a", "b"], ["e"]],[["f"]]]
rdd = sc.parallelize(data)
model = PrefixSpan.train(rdd, minSupport=0.1)
result = model.freqSequences().filter(lambda x: (x.freq >= 2)).filter(lambda x: (len(x.sequence) >=2) ).cache()
result.collect()
[FreqSequence(sequence=[[u'a'], [u'c']], freq=2)]
【讨论】:
以上是关于pyspark:ImportError:没有名为 numpy 的模块的主要内容,如果未能解决你的问题,请参考以下文章
Pyspark - ImportError:无法从“pyspark”导入名称“SparkContext”