大数据上的 GZipStream

Posted 2023-03-06

技术标签:

【中文标题】大数据上的 GZipStream【英文标题】：GZipStream on large data 【发布时间】：2012-05-16 15:29:53 【问题描述】：

我正在尝试压缩大量数据，有时在 100GB 的区域内，当我运行我编写的例程时，它看起来文件的大小与之前的大小完全相同。 GZipStream 有其他人遇到过这个问题吗？

我的代码如下：

        byte[] buffer = BitConverter.GetBytes(StreamSize);
        FileStream LocalUnCompressedFS = File.OpenWrite(ldiFileName);
        LocalUnCompressedFS.Write(buffer, 0, buffer.Length);
        GZipStream LocalFS = new GZipStream(LocalUnCompressedFS, CompressionMode.Compress);
        buffer = new byte[WriteBlock];
        UInt64 WrittenBytes = 0;
        while (WrittenBytes + WriteBlock < StreamSize)
        
            fromStream.Read(buffer, 0, (int)WriteBlock);
            LocalFS.Write(buffer, 0, (int)WriteBlock);
            WrittenBytes += WriteBlock;
            OnLDIFileProgress(WrittenBytes, StreamSize);
            if (Cancel)
                break;
        
        if (!Cancel)
        
            double bytesleft = StreamSize - WrittenBytes;
            fromStream.Read(buffer, 0, (int)bytesleft);
            LocalFS.Write(buffer, 0, (int)bytesleft);
            WrittenBytes += (uint)bytesleft;
            OnLDIFileProgress(WrittenBytes, StreamSize);
        
        LocalFS.Close();
        fromStream.Close();

StreamSize 是一个 8 字节的 UInt64 值，用于保存文件的大小。我将这 8 个字节原始写入文件的开头，所以我知道原始文件的大小。 Writeblock 的值为 32kb（32768 字节）。 fromStream 是从 FileStream 获取数据的流。压缩数据前面的 8 个字节是否会导致问题？

【问题讨论】：

您的代码是否适用于较小的文件？您能否确认您的代码正确压缩了较小的数据集 - 例如，您知道通常可以很好地压缩的文本文件... 【参考方案1】：

我使用以下代码进行了压缩测试，它在 7GB 和 12GB 文件上运行没有问题（两者都事先知道压缩“好”）。这个版本适合你吗？

const string toCompress = @"input.file";
var buffer = new byte[1024*1024*64];

using(var compressing = new GZipStream(File.OpenWrite(@"output.gz"), CompressionMode.Compress))
using(var file = File.OpenRead(toCompress))

    var bytesRead = 0;
    while(bytesRead < buffer.Length)
    
        bytesRead = file.Read(buffer, 0, buffer.Length);
        compressing.Write(buffer, 0, buffer.Length);

~~你检查过documentation吗？~~

GZipStream 类无法解压缩导致超过 8 GB 未压缩数据的数据。

您可能需要找到一个不同的库来支持您的需求，或者尝试将您的数据分解为可以安全地“缝合”在一起的 <=8GB 块。

【讨论】：

嗨，奥斯汀，感谢您的回答。我的程序不会解压缩，所以我认为这不重要吗？除非压缩也有 8gb 的限制。嗯...如果您需要更多呢？还有其他选择吗？流会有这种限制似乎很奇怪。说解压，OP说压缩。 @Skintkingle：您现在如何测试代码的有效性？嗨，Austin，我还没有确认代码压缩适用于较小的数据集，尽管我看不到代码 sn-p 'not' 工作。一秒钟让我将 StreamSize 设置为小一点，这样它只需要一部分数据，看看压缩后的大小是否更小。【参考方案2】：

Austin Salonen 的代码对我不起作用（错误，4GB 错误）。

这是正确的方法：

using System;
using System.Collections.Generic;
using System.Text;

namespace CompressFile

    class Program
    


        static void Main(string[] args)
        
            string FileToCompress = @"D:\Program Files (x86)\msvc\wkhtmltopdf64\bin\wkhtmltox64.dll";
            FileToCompress = @"D:\Program Files (x86)\msvc\wkhtmltopdf32\bin\wkhtmltox32.dll";
            string CompressedFile = System.IO.Path.Combine(
                 System.IO.Path.GetDirectoryName(FileToCompress)
                ,System.IO.Path.GetFileName(FileToCompress) + ".gz"
            );


            CompressFile(FileToCompress, CompressedFile);
            // CompressFile_AllInOne(FileToCompress, CompressedFile);

            Console.WriteLine(Environment.NewLine);
            Console.WriteLine(" --- Press any key to continue --- ");
            Console.ReadKey();
         // End Sub Main


        public static void CompressFile(string FileToCompress, string CompressedFile)
        
            //byte[] buffer = new byte[1024 * 1024 * 64];
            byte[] buffer = new byte[1024 * 1024]; // 1MB

            using (System.IO.FileStream sourceFile = System.IO.File.OpenRead(FileToCompress))
            

                using (System.IO.FileStream destinationFile = System.IO.File.Create(CompressedFile))
                

                    using (System.IO.Compression.GZipStream output = new System.IO.Compression.GZipStream(destinationFile,
                        System.IO.Compression.CompressionMode.Compress))
                    
                        int bytesRead = 0;
                        while (bytesRead < sourceFile.Length)
                        
                            int ReadLength = sourceFile.Read(buffer, 0, buffer.Length);
                            output.Write(buffer, 0, ReadLength);
                            output.Flush();
                            bytesRead += ReadLength;
                         // Whend

                        destinationFile.Flush();
                     // End Using System.IO.Compression.GZipStream output

                    destinationFile.Close();
                 // End Using System.IO.FileStream destinationFile 

                // Close the files.
                sourceFile.Close();
             // End Using System.IO.FileStream sourceFile

         // End Sub CompressFile


        public static void CompressFile_AllInOne(string FileToCompress, string CompressedFile)
        
            using (System.IO.FileStream sourceFile = System.IO.File.OpenRead(FileToCompress))
            
                using (System.IO.FileStream destinationFile = System.IO.File.Create(CompressedFile))
                

                    byte[] buffer = new byte[sourceFile.Length];
                    sourceFile.Read(buffer, 0, buffer.Length);

                    using (System.IO.Compression.GZipStream output = new System.IO.Compression.GZipStream(destinationFile,
                        System.IO.Compression.CompressionMode.Compress))
                    
                        output.Write(buffer, 0, buffer.Length);
                        output.Flush();
                        destinationFile.Flush();
                     // End Using System.IO.Compression.GZipStream output

                    // Close the files.        
                    destinationFile.Close();
                 // End Using System.IO.FileStream destinationFile 

                sourceFile.Close();
             // End Using System.IO.FileStream sourceFile

         // End Sub CompressFile


     // End Class Program


 // End Namespace CompressFile

【讨论】：

以上是关于大数据上的 GZipStream的主要内容，如果未能解决你的问题，请参考以下文章

大字节数组 - 在字节数组中存储长度有啥好处？

听起来很高大上的“大数据技术”到底是个啥？

返回服务上的大数据

优化大数据集上的多对多关系查询

将大整数压缩成尽可能小的字符串

如何优化大数据框上的 spark sql 操作？