输出文件未保存在我的存储桶中，在 AWS s3 中

Posted 2023-04-15

技术标签:

【中文标题】输出文件未保存在我的存储桶中，在 AWS s3 中【英文标题】：output file not saved on my bucket, in AWS s3 【发布时间】：2017-01-24 15:54:11 【问题描述】：

我正在尝试从 AWS 学习本教程。我在快速示例步骤。 https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/

当我尝试运行命令时：

aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE

我的输出文件没有出现在我的存储桶上，即使在 EMR 上，它表示作业已完成。

SparkWordCountApp   Completed   2017-01-24 16:35 (UTC+1)    10 seconds

这是wordcount python文件：

from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount  ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="WordCount")
    text_file = sc.textFile(sys.argv[1])
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile(sys.argv[2])
    sc.stop()

这是来自集群的日志文件：

17/01/25 14:40:19 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/01/25 14:40:19 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (20480+2048 MB) is above the max threshold (11520 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
    at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:304)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:164)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

我正在使用 m3.x 大型实例。

【问题讨论】：

spark.executor.memory 设置的值是多少？从命令行看是20g。是的，你已经提到了，我错过了。每个 m3.xlarge 实例只有 15g，但 executor 请求 20g+2g，而且 yarn 配置最多只允许 11.5g。能不能把它减到8g试试运行？ @franklinsijo，我试过了。 python 文件执行得很好，但我仍然没有输出文件。 outputbucket 已经创建了吗？你的 input.txt 不是空的吧？ 【参考方案1】：

尝试将输出目录设为子目录，而不是根目录。在不代表 EMR s3 客户端的情况下，我知道 Hadoop S3A 过去在目标是存储桶的根目录时遇到了一些与 rename() 相关的问题。否则，启动日志并查看从 com.aws 模块打印的内容

【讨论】：

我已将日志文件添加到我的问题中。

以上是关于输出文件未保存在我的存储桶中，在 AWS s3 中的主要内容，如果未能解决你的问题，请参考以下文章

将对象保存在具有公共访问权限的 AWS S3 存储桶中

S3 存储桶中的 AWS Lambda 代码未更新

使用 boto3 lib 和 AWS Lambda 从 S3 存储桶中的压缩文件中获取数据流

在运行 AWS Glue ETL 作业并命名输出文件名时，有没有办法从 S3 存储桶中读取文件名。 pyspark 是不是提供了一种方法来做到这一点？

同一S3存储桶中某些文件的AWS CORS问题

AWS S3 文件上传但存储桶中的文件没有大小？