Apache Spark错误使用hadoop将数据卸载到AWS S3

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Apache Spark错误使用hadoop将数据卸载到AWS S3相关的知识,希望对你有一定的参考价值。

我正在使用Apache Spark v2.3.1并尝试在处理之后将数据卸载到AWS S3。像这样的东西:

data.write().parquet("s3a://" + bucketName + "/" + location);

配置似乎很好:

        String region = System.getenv("AWS_REGION");
        String accessKeyId = System.getenv("AWS_ACCESS_KEY_ID");
        String secretAccessKey = System.getenv("AWS_SECRET_ACCESS_KEY");

        spark.sparkContext().hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.awsRegion", region);
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.awsAccessKeyId", accessKeyId);
        spark.sparkContext().hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", secretAccessKey);

%HADOOP_HOME%导致与Spark使用完全相同的版本(v2.6.5)并添加到Path中:

C:>hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  credential           interact with credential providers
  key                  manage keys via the KeyProvider
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME 

Maven也是如此:

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>2.6.5</version>
    </dependency>

但我仍然在写入时收到以下错误。有什么想法吗?

Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) ~[hadoop-common-2.6.5.jar:?]
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557) ~[hadoop-common-2.6.5.jar:?]
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977) ~[hadoop-common-2.6.5.jar:?]
答案

是的,我错过了一步。把这个:https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.4/bin带到%HADOOP_HOME%in。即使版本不匹配(v2.6.5 vs v2.6.4),这似乎仍然有效。

以上是关于Apache Spark错误使用hadoop将数据卸载到AWS S3的主要内容,如果未能解决你的问题,请参考以下文章

多节点 hadoop 集群中的 Apache Spark Sql 问题

Apache Spark Hadoop S3A SignatureDoesNotMatch

Spark + s3 - 错误 - java.lang.ClassNotFoundException:找不到类 org.apache.hadoop.fs.s3a.S3AFileSystem

return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask.

idea运行spark项目报错:org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0

如何在Spark提交中使用s3a和Apache spark 2.2(hadoop 2.8)?