如何让Hadoop读取以gz结尾的文本格式的文件

Posted 2023-03-21

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何让Hadoop读取以gz结尾的文本格式的文件相关的知识，希望对你有一定的参考价值。

　　####背景：搜索引擎在build全量时，会产生数G的xml的中间文件，我需要去查询这些中间文件中，是否有某个特殊的字符。xml文件有很多，每个都有几百M，存储在hdfs上，而且是以gz结尾的文本格式的文件。查找时，我是写了一个实现Tool接口，继承自Configured类的MapReduce，这样就可以传入自定义的参数给我的MapReduce程序了。需要在文件里Grep的内容，就是以参数的形式传入的。写完代码调试时，问题来了，会报这个异常： 14/10/17 12:06:33 INFO mapred.JobClient: Task Id : attempt_201405252001_273614_m_000013_0, Status : FAILED java.io.IOException: incorrect header check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:81) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:262) ###分析过程：通过上面的异常，立马猜想到是由于我的文件是gz结尾，所以hadoop把它当作了压缩文件，然后尝试解压缩后读取，所以解压失败了。于是去问google，没有搜到能够直接解决我问题的答案，但是搜到了此处相关的源代码：[LineRecordReader.java](http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/);于是尝试着去阅读代码来解决问题，这个类很简单，继承自RecordReader，没有看到next函数和readLine函数，那就应该是基类实现的。很快发现了看名字是跟压缩解码相关的代码： private CompressionCodecFactory compressionCodecs = null; ... compressionCodecs = new CompressionCodecFactory(job); final CompressionCodec codec = compressionCodecs.getCodec(file); ... if (codec != null) in = new LineReader(codec.createInputStream(fileIn), job); else ... in = new LineReader(fileIn, job); 此处file就是拿到的文件路径，可以看到，应该就是通过CompressionCode.getCode(file)函数，拿到的codec类，然后读取的时候出异常了。那怎么让MapReduce程序把这个.gz文件当作普通的文本文件呢？再点进去看[CompressionCodeFactory.java](http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/io/compress/CompressionCodecFactory.java#CompressionCodecFactory.%3Cinit%3E%28org.apache.hadoop.conf.Configuration%29)的代码。getCodec函数的代码如下： /** * Find the relevant compression codec for the given file based on its * filename suffix. * @param file the filename to check * @return the codec object */ public CompressionCodec getCodec(Path file) CompressionCodec result = null; if (codecs != null) String filename = file.getName(); String reversedFilename = new StringBuffer(filename).reverse().toString(); SortedMap subMap = codecs.headMap(reversedFilename); if (!subMap.isEmpty()) String potentialSuffix = subMap.lastKey(); if (reversedFilename.startsWith(potentialSuffix)) result = codecs.get(potentialSuffix); return result; 就是根据文件名称匹配来得到对应的解压缩类。咋们按图索骥，去看看codecs是在哪里赋值的： /** * Find the codecs specified in the config value io.compression.codecs * and register them. Defaults to gzip and zip. */ public CompressionCodecFactory(Configuration conf) codecs = new TreeMap (); List > codecClasses = getCodecClasses(conf); if (codecClasses == null) addCodec(new GzipCodec()); addCodec(new DefaultCodec()); else Iterator > itr = codecClasses.iterator(); while (itr.hasNext()) CompressionCodec codec = ReflectionUtils.newInstance(itr.next(), conf); addCodec(codec); 看样子从配置文件里，拿不到编码相关的配置，就会默认把GzipCodec,DefaultCodec加进去。再跟到getCodecClasses(conf)函数里去： /** * Get the list of codecs listed in the configuration * @param conf the configuration to look in * @return a list of the Configuration classes or null if the attribute * was not set */ public static List > getCodecClasses(Configuration conf) String codecsString = conf.get("io.compression.codecs"); if (codecsString != null) List > result = new ArrayList >(); StringTokenizer codecSplit = new StringTokenizer(codecsString, ","); while (codecSplit.hasMoreElements()) String codecSubstring = codecSplit.nextToken(); if (codecSubstring.length() != 0) try Classcls = conf.getClassByName(codecSubstring); if (!CompressionCodec.class.isAssignableFrom(cls)) throw new IllegalArgumentException("Class " + codecSubstring + " is not a CompressionCodec"); result.add(cls.asSubclass(CompressionCodec.class)); catch (ClassNotFoundException ex) throw new IllegalArgumentException("Compression codec " + codecSubstring + " not found.", ex); return result; else return null; 从这个函数里能够看到编码的配置是 **io.compression.codecs** 。可以看到，我们必须返回非null的result，那么直接让io.compression.codecs配置成空，应该就可以了，此时返回的result里面没有任何元素。 ###问题解决方案: 试了一把，执行这个MapReduce程序时，加上 **-Dio.compression.codecs=,** 的参数，就可以了： hadoop jar ./dumptools-0.1.jar ddump.tools.mr.Grep -Dio.compression.codecs=, "adgroupId=319356697" doc val 参考技术A 亲。只需要用HDFS的读取即可。因为虽然文件是GZ结尾。但是他仍然是文本文件。
如果真是文本文件那么需要你先解压缩。可以读入然后利用解压。再次读取。参考技术B 既然是文本文件，跟名字没关系，跟读一般文件一样。
比如文件test.txt，你改名为test test.st test.avi
都不会改变文件本质，照读不误参考技术C 读取是什么意思？文件本身是压缩包。如果是解压的话：
tar -zxvf filename..tar.gz

如何将文本文件从 DOS 格式转换为 UNIX 格式

【中文标题】如何将文本文件从 DOS 格式转换为 UNIX 格式【英文标题】：How to convert a text file from DOS format to UNIX format 【发布时间】：2021-12-28 00:45:51 【问题描述】：

我正在尝试用 C 语言编写一个程序，该程序读取一个文本文件并将 \r\n 替换为 \n 到将行结尾从 DOS 转换为 UNIX 的同一文件。我使用fgetc 并将文件视为二进制文件。提前致谢。

#include <stdio.h>

int main()

    FILE *fptr = fopen("textfile.txt", "rb+");
    if (fptr == NULL)
    
        printf("erro ficheiro \n");
        return 0;
    

     while((ch = fgetc(fptr)) != EOF) 
          if(ch == '\r') 
           fprintf(fptr,"%c", '\n');
         else 
         fprintf(fptr,"%c", ch);
        
    

    fclose(fptr);

【问题讨论】：

Windows（以及历史上的 DOS）在行尾同时使用 \r 和 \n，因此您需要删除 \r 而不是替换它。而且，不要尝试就地执行 - 制作一个单独的输出文件。二进制？不要将 \r 替换为 \n，因为它使用 \r\n，因此您最终会得到 \n\n。您覆盖下一个字符而不是替换字符。你不能像那样从你正在写入的同一个文件中读取。您需要两个文件指针——一个用于读取，另一个用于写入，都作为二进制文件打开。或者您可以在确保它是二进制文件流之后写信给stdout。请注意，DOS (Windows) 文件通常在每行的末尾有 "\r\n"；您只需要避免打印'\r' 字符。如果你遇到'\r' 后面没有'\n'，或者'\n' 前面没有'\r'，你会怎么做，这是任何人的猜测。我可能只是将两者都映射到'\n。 【参考方案1】：

如果我们假设文件使用单字节字符集，那么在将文本文件从 DOS 转换为 UNIX 时，我们只需忽略所有 '\r' 字符。

我们还假设文件的大小小于最大的无符号整数。

我们做这些假设的原因是为了保持例子简短。

请注意，按照您的要求，下面的示例会覆盖原始文件。通常您不应该这样做，因为如果发生错误，您可能会丢失原始文件的内容。

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>

// Return a negative number on failure and 0 on success.
int main()

    const char* filename = "textfile.txt";

    // Get the file size. We assume the filesize is not bigger than UINT_MAX.
    struct stat info;
    if (stat(filename, &info) != 0)
        return -1;
    size_t filesize = (size_t)info.st_size;

    // Allocate memory for reading the file
    char* content = (char*)malloc(filesize);
    if (content == NULL)
        return -2;

    // Open the file for reading
    FILE* fptr = fopen(filename, "rb");
    if (fptr == NULL)
        return -3;

    // Read the file and close it - we assume the filesize is not bigger than UINT_MAX.
    size_t count = fread(content, filesize, 1, fptr);
    fclose(fptr);
    if (count != 1)
        return -4;

    // Remove all '\r' characters 
    size_t newsize = 0;
    for (long i = 0; i < filesize; ++i) 
        char ch = content[i];
        if (ch != '\r') 
            content[newsize] = ch;
            ++newsize;
        
    

    // Test if we found any
    if (newsize != filesize) 
        // Open the file for writing and truncate it.
        FILE* fptr = fopen(filename, "wb");
        if (fptr == NULL)
            return -5;

        // Write the new output to the file. Note that if an error occurs,
        // then we will lose the original contents of the file.
        if (newsize > 0)
            count = fwrite(content, newsize, 1, fptr);
        fclose(fptr);
        if (newsize > 0 && count != 1)
            return -6;
    

    // For a console application, we don't need to free the memory allocated
    // with malloc(), but normally we should free it.

    // Success 
    return 0;
 // main()

只删除 '\r' 后跟 '\n' 用这个循环替换循环：

    // Remove all '\r' characters followed by a '\n' character
    size_t newsize = 0;
    for (long i = 0; i < filesize; ++i) 
        char ch = content[i];
        char ch2 = (i < filesize - 1) ? content[i + 1] : 0;
        if (ch == '\r' && ch2 == '\n') 
            ch = '\n';
            ++i;
        
        content[newsize++] = ch;

【讨论】：

"我们只需要忽略所有的 '\r' 字符" -- 值得商榷。这假定不会出现'\r' 字符，除非紧接在'\n' 之前。对于以与 C 实现的运行时字符集一致的单字节编码编码的 Windows 文本文件，这是一个相对安全的选择，但在技术上并不正确。另外，将整个文件读入内存是很糟糕的形式。 @JohnBollinger 这取决于文件大小和默认块大小，问题是关于将换行符从 DOS 转换为 UNIX。如果他想回答几个问题，他应该创建更多问题。 @JohnBollinger 同意删除“\r”。我遇到的带有单独的 '\r' 字符的 DOS 文本文件通常用于打印到旧打印机上，在旧打印机上你会覆盖同一行两次，通常用于下划线或创建粗体文本。我还看到它用于在显示器上为文本加下划线。但我已经 25 年没有在野外看到它了。添加了 sn-p 以显示如何仅替换 '\r' 后跟 '\n'。

以上是关于如何让Hadoop读取以gz结尾的文本格式的文件的主要内容，如果未能解决你的问题，请参考以下文章