使用 Java 开发工具包将多个文件批处理到 Amazon S3
Posted
技术标签:
【中文标题】使用 Java 开发工具包将多个文件批处理到 Amazon S3【英文标题】:Batching multiple files to Amazon S3 using the Java SDK 【发布时间】:2015-08-09 03:59:58 【问题描述】:我正在尝试通过附加文件将多个文件上传到 Amazon S3,这些文件都在同一个密钥下。我有一个文件名列表,并希望按该顺序上传/附加文件。我几乎完全遵循this tutorial,但我首先遍历每个文件并部分上传。因为文件在 hdfs 上(路径实际上是 org.apache.hadoop.fs.Path),所以我使用输入流来发送文件数据。下面是一些伪代码(我正在逐字评论教程中的块):
// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
try
int i = 1; // part number
for (String file : files)
Path filePath = new Path(file);
// Get the input stream and content length
long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
InputStream is = fss.get(branch).open(filePath);
long filePosition = 0;
while (filePosition < contentLength)
// create request
//upload part and add response to our list
i++;
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(bk.getBucket(),
bk.getKey(),
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
catch (Exception e)
//...
但是,我收到以下错误:
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1109)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:741)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2617)
如果有人知道这个错误的原因可能是什么,那将不胜感激。或者,如果有更好的方法将一堆文件连接到一个 s3 密钥中,那也很好。我尝试使用 java 的内置 SequenceInputStream 但这不起作用。任何帮助将不胜感激。作为参考,所有文件的总大小可能高达 10-15 GB。
【问题讨论】:
【参考方案1】:我知道这可能有点晚了,但值得我做出贡献。
我已经设法使用SequenceInputStream
解决了类似的问题。
诀窍在于能够计算结果文件的总大小,然后将Enumeration<InputStream>
提供给SequenceInputStream
。
下面是一些可能有帮助的示例代码:
public void combineFiles()
List<String> files = getFiles();
long totalFileSize = files.stream()
.map(this::getContentLength)
.reduce(0L, (f, s) -> f + s);
try
try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files)))
ObjectMetadata resultFileMetadata = new ObjectMetadata();
resultFileMetadata.setContentLength(totalFileSize);
s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
catch (IOException e)
LOG.error("An error occurred while combining files. ", e);
private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files)
return new Enumeration<InputStream>()
private Iterator<String> fileNamesIterator = files.iterator();
@Override
public boolean hasMoreElements()
return fileNamesIterator.hasNext();
@Override
public InputStream nextElement()
try
return new FileInputStream(Paths.get(fileNamesIterator.next()).toFile());
catch (FileNotFoundException e)
System.err.println(e.getMessage());
throw new RuntimeException(e);
;
希望这会有所帮助!
【讨论】:
以上是关于使用 Java 开发工具包将多个文件批处理到 Amazon S3的主要内容,如果未能解决你的问题,请参考以下文章
Java多线程工具包java.util.concurrent---原子性和ABA问题