如何在 Java 中加速读写 base64 编码的 gzip 大文件

Posted 2023-04-18

技术标签:

【中文标题】如何在 Java 中加速读写 base64 编码的 gzip 大文件【英文标题】：How to speed up read write base64 encoded gzipped large files in Java 【发布时间】：2019-10-01 22:21:17 【问题描述】：

任务是压缩/解压缩非常大的数据> 2G，单个String或ByteArray无法保存。我的解决方案是将压缩/解压缩的数据逐块写入文件。它有效，但速度不够快。

压缩：纯文本文件 -> gzip -> base64 编码 -> 压缩文件解压缩：压缩文件 -> base64 解码 -> gunzip -> 纯文本文本文件

笔记本电脑测试结果，16G内存。

Created compressed file, takes 571346 millis
Created decompressed file, takes 378441 millis

代码块

public static void compress(final InputStream inputStream, final Path outputFile) throws IOException 
    try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
        final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
        final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) 

      reader.lines().forEach(line -> 
        try 
          gzipOutput.write(line.getBytes());
          gzipOutput.write(System.getProperty("line.separator").getBytes());
         catch (final IOException e) 
          e.printStackTrace();
        
      );
    
  

public static void decompress(final InputStream inputStream, final Path outputFile) throws IOException 
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
      final GzipCompressorInputStream gzipStream = new GzipCompressorInputStream(Base64.getDecoder().wrap(inputStream));
      final BufferedReader reader = new BufferedReader(new InputStreamReader(gzipStream))) 

    reader.lines().forEach(line -> 
      try 
        outputStream.write(line.getBytes());
        outputStream.write(System.getProperty("line.separator").getBytes());
       catch (final IOException e) 
        e.printStackTrace();
      
    );

此外，我尝试在将数据发送到文件时进行批量写入，并没有看到太大的改进。

# batch write
public static void compress(final InputStream inputStream, final Path outputFile) throws IOException 
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
      final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
      final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output);
      final BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) 

    StringBuilder stringBuilder = new StringBuilder();
    final int chunkSize = Integer.MAX_VALUE / 1000;

    String line;
    int counter = 0;
    while((line = reader.readLine()) != null) 
      counter++;
      stringBuilder.append(line).append(System.getProperty("line.separator"));
      if(counter >= chunkSize) 
        gzipOutput.write(stringBuilder.toString().getBytes());
        counter = 0;
        stringBuilder = new StringBuilder();
      
    

    if (counter > 0) 
      gzipOutput.write(stringBuilder.toString().getBytes());

问题

寻求有关如何加快整体流程的建议会有哪些瓶颈？

2019 年 10 月 2 日更新

我又做了一些测试，结果表明base64编码是瓶颈。

public static void compress(final InputStream inputStream, final Path outputFile) throws IOException 
  try (final OutputStream outputStream = new FileOutputStream(outputFile.toString());
       final OutputStream base64Output = Base64.getEncoder().wrap(outputStream);
       final GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(base64Output)) 

    final byte[] buffer = new byte[4096];
    int n = 0;
    while (-1 != (n = inputStream.read(buffer))) 
      gzipOutput.write(buffer, 0, n);

2.2G 测试文件，2150 万行仅复制文件：~ 2 秒仅 Gzip 文件：~ 12 秒 Gzip + base64：~ 500 秒

【问题讨论】：

这很有趣，我没想到base64部分会那么慢 @harold 没错，结果令人惊讶！当我阅读另一篇关于 base64 性能的文章 java-performance.info/base64-encoding-and-decoding-performance 时，结果表明，从 byte[] 到 byte[] 编码 200MB 需要 0.45 秒。在我们的例子中，2.2G 大约需要 5 秒。一定是其他东西让它变慢了，到目前为止我还想不通。我看到如果写入它的流逐字节（或其他小片段）写入 Base64 会很慢，因为转换器喜欢同时转换 3 个字节，但似乎就像 gzip 流甚至不这样做（最终它通过 PendingBuffer），所以我仍然无法解释它 【参考方案1】：

首先：永远不要默认字符集，因为它不可移植。

String s = ...;
byte[] b = ...;
b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);

对于文本的压缩不涉及阅读器，因为它将给定一些字符集的字节转换为字符串（保存 Unicode），然后再次转换回来。此外，字符串的 char 需要 2 个字节（UTF-16），而基本 ASCII 符号需要 1 个字节。

Base64 将二进制转换为 64 个 ASCII 符号的字母表，需要 4/3 的空间。当数据必须以 XML 等格式传输时，请勿这样做。

可以（解）压缩大文件。

final int BUFFER_SIZE = 1024 * 64;
Path textFile = Paths.get(".... .txt");
Path gzFile = textFile.resolveSibling(textFile.getFileName().toString() + ".gz");

try (OutputStream out = new GzipOutputStream(Files.newOutputStream(gzFile), BUFFER_SIZE))) 
    Files.copy(textFile, out);


try (InputStream in = new GzipInputStream(Files.newInputStream(gzFile), BUFFER_SIZE))) 
    Files.copy(in, textFile);

通常会忽略可选参数 BUFFER_SIZE，这可能会降低性能。

copy 可以有额外的参数来处理文件冲突。

【讨论】：

【参考方案2】：

大文件总是需要一些时间，但我看到了两个重要的机会：

line

string

line

对于更快的流到流复制，您可以使用例如 IOUtils.copy(in, out)（它也在 Apache Commons 中，看起来您已经在使用），或者自己实施类似的策略：读取块数据到byte[]（几个KB，不是很小的东西），然后将其写入输出流，直到输入全部被读取。

【讨论】：

以上是关于如何在 Java 中加速读写 base64 编码的 gzip 大文件的主要内容，如果未能解决你的问题，请参考以下文章