替换大文本文件中的长列表单词

Posted 2023-03-11

技术标签:

【中文标题】替换大文本文件中的长列表单词【英文标题】：Replace Long list Words in a big Text File 【发布时间】：2012-01-27 01:27:02 【问题描述】：

我需要一种快速的方法来处理大文本文件

我有 2 个文件，一个大文本文件（~20Gb）以及另一个包含约 1200 万个 Combo 单词列表的文本文件

我想在第一个文本文件中找到所有组合词并将其替换为另一个组合词（带下划线的组合词）

示例“计算机信息”>替换为>“计算机信息”

我使用此代码，但性能很差（我在具有 16Gb 内存和 16 核的 Hp G7 服务器中进行测试）

public partial class Form1 : Form

    HashSet<string> wordlist = new HashSet<string>();

    private void loadComboWords()
    
        using (StreamReader ff = new StreamReader(txtComboWords.Text))
        
            string line;
            while ((line = ff.ReadLine()) != null)
            
                wordlist.Add(line);
            
        
    

    private void replacewords(ref string str)
    

        foreach (string wd in wordlist)
        
          //  ReplaceEx(ref str,wd,wd.Replace(" ","_"));
            if (str.IndexOf(wd) > -1)
                str.Replace(wd, wd.Replace(" ", "_"));
        
    

    private void button3_Click(object sender, EventArgs e)
    
        string line;
        using (StreamReader fread = new StreamReader(txtFirstFile.Text))
        
            string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
            StreamWriter sw = new StreamWriter(writefile);
            long intPercent;
            label3.Text = "initialing";
            loadComboWords();

            while ((line = fread.ReadLine()) != null)
            
                replacewords(ref line);
                sw.WriteLine(line);

                intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
                Application.DoEvents();
                label3.Text = intPercent.ToString();
            
            sw.Close();
            fread.Close();
            label3.Text = "Finished";

在合理的时间内完成这项工作的任何想法

谢谢

【问题讨论】：

【参考方案1】：

乍一看，您采用的方法看起来不错 - 它应该可以正常工作，并且没有什么明显的原因会导致例如大量垃圾收集。

我认为主要的事情是您只会使用这 16 个内核中的一个：没有任何地方可以在其他 15 个内核之间分担负载。

我认为最简单的方法是将 20Gb 的大文件分成 16 个块，然后一起分析每个块，然后将这些块重新合并在一起。与将这 16 个块一起扫描所涉及的约 16 倍增益相比，拆分和重新组装文件所花费的额外时间应该是最少的。

概括地说，一种方法可能是：

    private List<string> SplitFileIntoChunks(string baseFile)
    
        // Split the file into chunks, and return a list of the filenames.
    

    private void AnalyseChunk(string filename)
    
        // Analyses the file and performs replacements, 
        // perhaps writing to the same filename with a different
        // file extension
    

    private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
    
        // Combines the rewritten chunks created by AnalyseChunk back into
        // one large file, outputFile.
    

    public void AnalyseFile(string inputFile, string outputFile)
    
        List<string> splitFileNames = SplitFileIntoChunks(inputFile);

        var tasks = new List<Task>();
        foreach (string chunkName in splitFileNames)
        
            var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
            tasks.Add(task);
        

        Task.WaitAll(tasks.ToArray());

        CreateOutputFileFromChunks(outputFile, splitFileNames);

一个小问题：将流长度的计算移出循环，你只需要得到一次。

编辑：另外，包括@Pavel Gatilov 的想法，以反转内部循环的逻辑并在 1200 万列表中搜索该行中的每个单词。

【讨论】：

使用这种类型的并行处理需要注意的一点：I/O 绑定非常容易，尤其是使用旋转磁盘（与 SSD 相比）。打开 16 个文件以供读取和 16 个文件供写入是需要大量查找的。span> @MarkPeters - 同意； OP，您可能想尝试调整线程数以考虑到这一点。 tanx，我在考虑多线程，但是对于每一行（可能会破坏行顺序），但你的想法是更好的解决方案。 @Mark - 同意，但我的系统有 15k 硬盘旋转速度，我认为可以处理此操作。【参考方案2】：

几个想法：

StringBuilder

更新

下面是一个演示逐字索引思想的示例代码：

static void ReplaceWords()

  string inputFileName = null;
  string outputFileName = null;

  // this dictionary maps each single word that can be found
  // in any keyphrase to a list of the keyphrases that contain it.
  IDictionary<string, IList<string>> singleWordMap = null;

  using (var source = new StreamReader(inputFileName))
  
    using (var target = new StreamWriter(outputFileName))
    
      string line;
      while ((line = source.ReadLine()) != null)
      
        // first, we split each line into a single word - a unit of search
        var singleWords = SplitIntoWords(line);

        var result = new StringBuilder(line);
        // for each single word in the line
        foreach (var singleWord in singleWords)
        
          // check if the word exists in any keyphrase we should replace
          // and if so, get the list of the related original keyphrases
          IList<string> interestingKeyPhrases;
          if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
            continue;

          Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);

          // then process each of the keyphrases
          foreach (var interestingKeyphrase in interestingKeyPhrases)
          
            // and replace it in the processed line if it exists
            result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
          
        

        // now, save the processed line
        target.WriteLine(result);
      
    
  


private static string GetTargetValue(string interestingKeyword)

  throw new NotImplementedException();


static IEnumerable<string> SplitIntoWords(string keyphrase)

  throw new NotImplementedException();

代码展示了基本思路：

我们将关键短语和处理后的行拆分为可以有效比较的等效单元：单词。我们存储了一个字典，可以快速为我们提供包含该词的所有关键词的引用。然后我们应用您的原始逻辑。但是，我们并不是针对所有 1200 万个关键短语都这样做，而是针对与处理后的行至少有一个单词交集的一小部分关键短语。

剩下的实现交给你。

但是代码有几个问题：

SplitIntoWords

GetTargetValue

StringBuilder

string

KeywordsIndex

【讨论】：

我个人会使用 Dictionary 这样每个字符串理论上都可以包含它自己的唯一 ID，但这完全取决于他.. 还有为什么在 Hash 时创建一个 Hash 通常是字符串，对象为什么不将文件加载到 List 也许它是 1 个中的 6 个 .. 哦，是的，+1 用于反转内部循环中的逻辑。将 1200 万个单词列表存储在一个哈希集中，然后在该哈希集中查找每一行的单词。不错的一个：并且可以很好地与我上面描述的并行化一起工作。对于根据 1200 万个条目的列表检查每一行的过程，为这些条目构建一个索引然后进行优化搜索可能会有所帮助（例如，为每个条目构建一个 int 哈希码条目，然后查找由空格分隔的单词，为这两个单词构建哈希码并对照大哈希码列表进行检查），也只用下划线替换空格（不做整个字符串替换）。跨度> @DJKRAZE, @jCoder 我们都在谈论.NET HashSet<T> 类吗？它是一个哈希表。它使用内置的Object.GetHashCode() 方法来计算所有存储项目的哈希值。它实际上等同于Dictionary<int, T>，其键是通过调用GetHashCode 计算出来的，唯一的区别是它隐式地为你做这件事。这两个类的内部结构是相同的。所以把HashSet改成Dictionary是没有意义的，目前的方式是最简单快捷的。我使用这个内部循环来解决复合词问题，我的复合词没有固定词或长度（任何数量的词和任何小于 25 个字符的长度），我必须使用滑动窗口吗？我不明白“制作适当的索引”，如果可能的话，给我看一个示例代码，Tanx

以上是关于替换大文本文件中的长列表单词的主要内容，如果未能解决你的问题，请参考以下文章