在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是啥？

Posted 2023-03-31

技术标签:

【中文标题】在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是啥？【英文标题】：In Powershell, what's the most efficient way to split a large text file by record type?在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是什么？ 【发布时间】：2010-01-29 03:44:24 【问题描述】：

我正在使用 Powershell 进行一些 ETL 工作，读取压缩文本文件并根据每行的前三个字符将它们拆分。

如果我只是过滤输入文件，我可以将过滤后的流通过管道传输到 Out-File 并完成它。但我需要将输出重定向到多个目的地，据我所知，这不能用简单的管道完成。我已经在使用 .NET 流读取器来读取压缩的输入文件，我想知道是否还需要使用流写入器来写入输出文件。

天真的版本看起来像这样：

while (!$reader.EndOfFile) 
  $line = $reader.ReadLine();
  switch ($line.substring(0,3) 
    "001" Add-Content "output001.txt" $line
    "002" Add-Content "output002.txt" $line
    "003" Add-Content "output003.txt" $line

这看起来像是个坏消息：每行查找、打开、写入和关闭文件一次。输入文件是 500MB 以上的巨大怪物。

有没有一种惯用的方式来使用 Powershell 结构有效地处理这个问题，或者我应该求助于 .NET 流编写器吗？

我可以为此使用（New-Item "path" -type "file"）对象的方法吗？

编辑上下文：

我正在使用DotNetZip 库将 ZIP 文件作为流读取；因此streamreader 而不是Get-Content/gc。示例代码：

[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll") 
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")

foreach ($entry in $zipfile) 
  $reader = new-object system.io.streamreader $entry.OpenReader();
  while (!$reader.EndOfFile) 
    $line = $reader.ReadLine();
    #do something here

我可能应该Dispose() $zipfile 和 $reader，但这是另一个问题！

【问题讨论】：

【参考方案1】：

阅读

至于读取文件和解析，我会用switch声明：

switch -file c:\temp\***.testfile2.txt -regex 
  "^001" Add-Content c:\temp\***.testfile.001.txt $_
  "^002" Add-Content c:\temp\***.testfile.002.txt $_
  "^003" Add-Content c:\temp\***.testfile.003.txt $_

我认为这是更好的方法，因为

支持正则表达式，你不支持必须制作子字符串（这可能贵）和参数 -file 非常方便；）

写作

至于写输出，我会测试使用streamwriter，但是如果Add-Content的性能对你来说还不错，我会坚持下去。

添加： Keith 建议使用>> 运算符，但是，它似乎很慢。除此之外，它以 Unicode 格式写入输出，使文件大小翻倍。

看看我的测试：

[1]: (measure-command 
>>     gc c:\temp\***.testfile2.txt  | %$c = $_; switch ($_.Substring(0,3)) 
>>             '001'$c >> c:\temp\***.testfile.001.txt `
>>             '002'$c >> c:\temp\***.testfile.002.txt `
>>             '003'$c >> c:\temp\***.testfile.003.txt
>> ).TotalSeconds
>>
159,1585874
[2]: (measure-command 
>>     gc c:\temp\***.testfile2.txt  | %$c = $_; switch ($_.Substring(0,3)) 
>>             '001'$c | Add-content c:\temp\***.testfile.001.txt `
>>             '002'$c | Add-content c:\temp\***.testfile.002.txt `
>>             '003'$c | Add-content c:\temp\***.testfile.003.txt
>> ).TotalSeconds
>>
9,2696923

差别巨大。

只是为了比较：

[3]: (measure-command 
>>     $reader = new-object io.streamreader c:\temp\***.testfile2.txt
>>     while (!$reader.EndOfStream) 
>>         $line = $reader.ReadLine();
>>         switch ($line.substring(0,3)) 
>>             "001" Add-Content c:\temp\***.testfile.001.txt $line
>>             "002" Add-Content c:\temp\***.testfile.002.txt $line
>>             "003" Add-Content c:\temp\***.testfile.003.txt $line
>>             
>>         
>>     $reader.close()
>> ).TotalSeconds
>>
8,2454369
[4]: (measure-command 
>>     switch -file c:\temp\***.testfile2.txt -regex 
>>         "^001" Add-Content c:\temp\***.testfile.001.txt $_
>>         "^002" Add-Content c:\temp\***.testfile.002.txt $_
>>         "^003" Add-Content c:\temp\***.testfile.003.txt $_
>>     
>> ).TotalSeconds
8,6755565

补充：我对写作表现很好奇..有点惊讶

[8]: (measure-command 
>>     $sw1 = new-object io.streamwriter c:\temp\***.testfile.001.txt3b
>>     $sw2 = new-object io.streamwriter c:\temp\***.testfile.002.txt3b
>>     $sw3 = new-object io.streamwriter c:\temp\***.testfile.003.txt3b
>>     switch -file c:\temp\***.testfile2.txt -regex 
>>         "^001" $sw1.WriteLine($_)
>>         "^002" $sw2.WriteLine($_)
>>         "^003" $sw3.WriteLine($_)
>>     
>>     $sw1.Close()
>>     $sw2.Close()
>>     $sw3.Close()
>>
>> ).TotalSeconds
>>
0,1062315

快 80 倍。现在你必须决定 - 如果速度很重要，请使用StreamWriter。如果代码清晰很重要，请使用Add-Content。

子字符串与正则表达式

根据 Keith 的说法，Substring 快了 20%。这取决于，一如既往。但是，在我的情况下，结果是这样的：

[102]: (measure-command 
>>     gc c:\temp\***.testfile2.txt  | %$c = $_; switch ($_.Substring(0,3)) 
>>             '001'$c | Add-content c:\temp\***.testfile.001.s.txt `
>>             '002'$c | Add-content c:\temp\***.testfile.002.s.txt `
>>             '003'$c | Add-content c:\temp\***.testfile.003.s.txt
>> ).TotalSeconds
>>
9,0654496
[103]: (measure-command 
>>     gc c:\temp\***.testfile2.txt  | %$c = $_; switch -regex ($_) 
>>             '^001'$c | Add-content c:\temp\***.testfile.001.r.txt `
>>             '^002'$c | Add-content c:\temp\***.testfile.002.r.txt `
>>             '^003'$c | Add-content c:\temp\***.testfile.003.r.txt
>> ).TotalSeconds
>>
9,2563681

所以区别并不重要，对我来说，正则表达式更具可读性。

【讨论】：

实际上，子字符串的速度要快 20%。很好地掌握了添加内容与 >> 的速度。在我的测试中，使用 Out-File -enc ascii 似乎与 Add-Content 相当。有趣的是，使用 streamwriter 快得多。是的，我也很惊讶。我添加了一些关于子字符串/正则表达式的度量。如果你想比较StreamWriter的速度，这里是我生成测试文件的代码：

1..5000 | %  $n = Get-Random -Min 1 -Max 4; $x=1..(Get-Random -Min 20 -Max 150) | %  ([char](Get-Random -Min 65 -Max 120)) ; $x = $x -join ""; '0:000 1' -f $n,$x  | Add-Content C:\temp\***.testfile.txt

（现在行数是5000）看起来它对我来说是流作家。感谢这一点，我是 Powershell 的新手，所有特定于我的任务的示例都非常有帮助。我不能使用 switch -file 构造，但很高兴知道它在我处理未压缩文件时可用。【参考方案2】：

鉴于输入文件的大小，您肯定希望一次处理一行。我认为重新打开/关闭输出文件不会对性能造成太大影响。它确实使使用管道甚至作为单线实现成为可能 - 与您的实现并没有太大不同。我将它包裹在这里以摆脱水平滚动条：

gc foo.log | %switch ($_.Substring(0,3)) 
    '001'$input | out-file output001.txt -enc ascii -append `
    '002'$input | out-file output002.txt -enc ascii -append `
    '003'$input | out-file output003.txt -enc ascii -append

【讨论】：

Keith，在$_ >> output001.txt 语句中，$_ 变量不是来自for-each 而是来自switch - 它只包含子字符串。我只需要打麻袋。这里已经很晚了，我只是变得有气势。 :-)

以上是关于在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是啥？的主要内容，如果未能解决你的问题，请参考以下文章