Amazon EMR:Pyspark 有奇怪的依赖问题
Posted
技术标签:
【中文标题】Amazon EMR:Pyspark 有奇怪的依赖问题【英文标题】:Amazon EMR: Pyspark having strange dependency issues 【发布时间】:2016-01-31 21:59:35 【问题描述】:我在让 pyspark 作业在 EMR 集群上运行时遇到问题,所以我登录到主节点并直接在那里运行 spark-submit
我有一个提交给 pyspark 的 python 文件,在这个文件中我有:
import subprocess
from pyspark import SparkContext, SparkConf
import boto3
from boto3.s3.transfer import S3Transfer
import os, re
import tarfile
import time
...
当我尝试在集群模式下运行它时,我得到: (取自原木,为简洁起见)
16/01/31 21:45:57 INFO spark.CacheManager: Partition rdd_2_0 not found, computing it
16/01/31 21:45:57 INFO spark.CacheManager: Partition rdd_1_0 not found, computing it
16/01/31 21:45:57 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
稍后我收到有关无法导入 boto3 的错误。
如果我在客户端模式下运行,我只会收到关于 boto3.s3.transfer 的 ImportError。
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-39-79.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
但是,如果我检查 pip freeze:
boto3==1.2.3
botocore==1.3.23
如果我在主服务器上打开 Spark Shell 并执行以下操作:
import boto3
client = boto3.client("s3")
效果很好。
这里发生了某种虚拟环境吗?我完全糊涂了。
编辑 忘了说我使用的是最新的 EMR 版本和 Spark 1.6.0。
此外,这在我自己的机器上的本地模式下也可以正常工作。
【问题讨论】:
不管您的问题如何,考虑使用 Spark 的脚本来启动 ec2 集群而不是 EMR。 @member555 我真的不愿意,这个项目背后的整个想法是通过 EMR 轻松配置它,它应该得到支持。当然我可以这样做,但是我必须使用设置脚本,这更加痛苦。 【参考方案1】:好吧,笨蛋,我发现了问题。
原来我必须pip install boto3
,EMR 节点默认没有安装这个。
这是错误描述性很强的一种情况。
【讨论】:
以上是关于Amazon EMR:Pyspark 有奇怪的依赖问题的主要内容,如果未能解决你的问题,请参考以下文章
尝试为在 Amazon EMR 上运行的 Pyspark 安装 pandas
如何使 Pyspark 脚本在 Amazon EMR 上运行以识别 boto3 模块?它说找不到模块
如何将 pyspark 中的数据保存在 Amazon EMR 的 1 个文件中