C++ 和读取大型文本文件

Posted 2023-03-11

技术标签:

【中文标题】C++ 和读取大型文本文件【英文标题】：C++ and reading large txt files 【发布时间】：2021-12-05 00:19:15 【问题描述】：

我有很多 txt 文件，大约 10GB。我应该在我的程序中使用什么将它们合并到一个文件中而不重复？我想确保输出文件中的每一行都是唯一的。

我正在考虑制作某种哈希树并使用 MPI。我希望它有效。

【问题讨论】：

什么是“重复”？复制整个文件？文件中的重复行？文件中的字符重复？ cat *.txt | sort | uniq 呢？（老实说，10 GB 在我的书中并不是“大数据”，但对于不同的人来说可能会有所不同，我猜这个标签有点“模棱两可”:)） sort -u *.txt 如果您使用某种 *nix 应该没问题 - 或者您是否需要这些行与原始文件之一中的顺序相同？ @SecurityBreach 排序是一种发现重复项的简单方法。你排序，然后你只检查连续的行是否相同（这真的是从根本上教授计算机科学的算法思维！这很有趣！）。 【参考方案1】：

如果您没有要求保持低内存使用率，您可以将所有文件中的所有行读入std::set 或std::unordered_set。顾名思义，unordered_set 不以任何特定方式排序，而set 是（字典排序顺序）。我在这里选择了std::set，但您可以尝试使用std::unordered_set，看看是否会加快速度。

例子：

#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>

int cppmain(std::string_view program, std::vector<std::string_view> args) 
    if(args.empty()) 
        std::cerr << "USAGE: " << program << " files...\n";
        return 1;
    

    std::set<std::string> result;   // to store all the unique lines

    // loop over all the filenames the user supplied
    for(auto& filename : args) 

        // try to open the file
        if(std::ifstream ifs(filename.data()); ifs) 
            std::string line;

            // read all lines and put them in the set:
            while(std::getline(ifs, line)) result.insert(line);
         else 
            std::cerr << filename << ": " << std::strerror(errno) << '\n';
            return 1;
        
    

    for(auto line : result) 
        // ... manipulate the unique line here ...

        std::cout << line << '\n'; // and print the result
    
    return 0;


int main(int argc, char* argv[]) 
    return cppmain(argv[0], argv + 1, argv + argc);

【讨论】：

【参考方案2】：

std::vector<std::string>

std::multimap

std::pair<uint32_t filenumber, size_t byte_start_of_line>

seek

这只占用最长行所需的 RAM，加上足够的 RAM 用于文件名 + 文件编号加上开销，再加上映射的空间，这应该远远小于实际行。由于 10GB 并不是真正的文本，因此您不太可能发生哈希冲突，因此如果您不确定，那么您不妨跳过“检查现有文件”部分，但所有行的概率足够高在您的输出中。

【讨论】：

以上是关于C++ 和读取大型文本文件的主要内容，如果未能解决你的问题，请参考以下文章