如何读取内存中的 csv 文件以进行快速处理?
Posted
技术标签:
【中文标题】如何读取内存中的 csv 文件以进行快速处理?【英文标题】:How to read a csv file in memory for fast processing? 【发布时间】:2021-07-29 23:57:23 【问题描述】:我尝试从 csv 文件导入数据,并使用 System.IO.StreamReader 方法读取内容。但是,此方法仅返回 csv 文件中包含的行数。请告诉我,该方法是否正确使用?
function Update-ArchiWithImportData
# Import the contents of the Extract_AGRe_TWS_ALL_20200925-01.csv and Create ElementCsv object
$ElementCsv=Get-Content "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
$startTime = get-date
Write-Verbose "Reading $ElementCsv..."
$Reader= New-Object System.IO.StreamReader -Argument $ElementCsv
while ( ($read=$Reader.ReadLine()) -ne $null)
#Loop through all the record in the CSV file
$NewModifiedElement= Measure-Command ForEach($Entry in $read)
if ($Entry."Script or expected file(s)" -ilike 'technical')
$Entry.Jobstream=$Entry.Jobstream.trimStart('PAXCL')
else
# Get the name of jobSet without extension .ksh ou .bat
$Entry.Jobstream=$Entry."Script or expected file(s)"
# Write-Host $Entry.Jobstream
# Write-Host $Entry.Jobstream.length
$pos_last_point = $Entry.Jobstream.LastIndexOf(".")
#Write-Host $pos_last_point
$Entry.Jobstream = $Entry.Jobstream.Substring(0,$pos_last_point).trimStart('P')
$Entry
# Export Extract_AGRe_TWS_ALL_20200925-01.csv in new Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file
$NewModifiedElement | Export-Csv "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv" -NoTypeInformation -Encoding UTF8
#End of the generation of the csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file, start of the jobstream grep search in the $env:USERPROFILE\Desktop\Archi\elements.csv file and the csv X file.
Write-Host "End of the generation of the csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file"
$GroupPath=$NewModifiedElement
$stream = New-Object IO.StreamWriter($GroupPath,$true)
$stream.WriteLine($read)
$stream.Close()
Write-Host -Object '✔' -ForegroundColor green
$Reader.ReadToEnd()
"Elapsed time: $((get-date)-$startTime)"
$Reader.Close()
$ReadPS.Dispose()
# Import the files
$ElementDifference= Import-csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv | Select-Object -skip 1
$ElementInLocalHostArchi=Import-csv $env:USERPROFILE\Desktop\Archi\elements.csv | Select-Object -skip 1
$Output=@()
Measure-Command
foreach ($row in $ElementDifference)
foreach ($Line in $ElementInLocalHostArchi)
if ($Line."Name" -contains $row."Jobstream")
$Output +=$Line
Write-Host " $Output is found"
Continue
elseif ($Line."Name" -notcontains $row."Jobstream")
$Line."Name"=$row."Jobstream"
$Line."Documentation"=$row."Jobstream Description"
#$Line."ID"
#$Line."Type"
#Jobstream;Jobstream Description;Op num;Job;Script or expected file(s);Server;user;location;Job Description;FIELD10
$Output +=$Line
Write-Host " $Output not found and was insert in new line"
continue
$Output | Export-Csv "$env:USERPROFILE\Desktop\Archi\ElementChange.csv" -NoTypeInformation -Encoding UTF8
$content1 = $NewModifiedElement
$content2 = $ElementInLocalHostArchi
$minCount=[Math]::Min($content1.Count,$content2.Count)
$comparedLines = Compare-Object $content1 $content2 -IncludeEqual:$IncludeEqual -ExcludeDifferent:$ExcludeDifferent -SyncWindow 1 |
Group-Object $_.InputObject.ReadCount | Sort-Object Name
$comparedLines | ForEach-Object
$curr=$_
switch ($_.Group[0].SideIndicator)
"==" $right=$left=$curr.Group[0].InputObject;break
"=>"
$right,$left = $curr.Group[0].InputObject,$curr.Group[1].InputObject
if ($curr.Count -eq 1 -and [int]$curr.Name -gt $minCount)
$left="N/A"
break
"<="
$right,$left = $curr.Group[1].InputObject,$curr.Group[0].InputObject
if ($curr.Count -eq 1 -and [int]$curr.Name -gt $minCount)
$right="N/A"
break
New-Object PSObject -Property @
Line = $_.Name
($ReferenceObject | Split-Path -Leaf) = $left
($DifferenceObject | Split-Path -Leaf) = $right
| Sort-Object [int]$_.Line
# Import the files
$Difference= Import-csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv -Header Jobstream | Select-Object "Jobstream"
$Reference=Import-csv $env:USERPROFILE\Desktop\Archi\elements.csv -Header Name | Select-Object "Name"
# Get the list of properties
$props1 = $Difference | Get-Member -MemberType NoteProperty | Select-Object -expand Name | Sort-Object | ForEach-Object "$_"
$props2 = $Reference | Get-Member -MemberType NoteProperty | Select-Object -expand Name | Sort-Object | ForEach-Object "$_"
if(Compare-Object $props1 $props2)
# Check that properties match
throw "Properties are not the same! [$props1] [$props2]"
else
# Pass properties list to Compare-Object
"Checking $props1"
Compare-Object $Difference $Reference -Property $props1
Update-ArchiWithImportData # Call the function
Read-Host -Prompt “Press Enter to exit”
另一方面,如果我不使用 Powershell System.IO.StreamReader Read 方法,则会生成文件,但要经过几个小时的处理。大小为 29.9 MB,包含 309905 行。文件大小为 .拜托,你能帮我解决这个问题吗
【问题讨论】:
【参考方案1】:我不完全清楚这些行是什么意思:
$ElementCsv = Get-Content "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
...
$Reader = New-Object System.IO.StreamReader -Argument $ElementCsv
但它几乎肯定没有在做你所期望的。
默认情况下,Get-Content
返回一个字符串数组 - 文件中的每一行一个字符串,因此您将一个字符串数组传递给 StreamReader
的构造函数,而 PowerShell 正在“帮助”尝试查找构造函数采用与$ElementCsv
变量匹配的参数的重载(即其中之一:https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.-ctor?view=net-5.0)。
因为$ElementCsv
是一个数组,所以它试图找到一个构造函数,该构造函数的参数数量与您的数组中的items 数量相同(即309905),并且不出所料,找不到一个...
这个例子可能会更清楚:
# write a csv file
Set-Content -Path "c:\temp\myfile.csv" -Value @"
row1_firstname,row1_lastname,row1_address
row2_firstname,row2_lastname,row2_address
row3_firstname,row3_lastname,row3_address
row4_firstname,row4_lastname,row4_address
row5_firstname,row5_lastname,row5_address
row6_firstname,row6_lastname,row6_address
"@;
# get-content reads an array of strings - one entry per line in the file
$lines = Get-Content -Path "c:\temp\myfile.csv";
# it's definitely an array
write-host $lines.GetType().FullName
# System.Object[]
# and it's got 6 entries
write-host $lines.Length
# 6
# and these are the entries...
$lines | % write-host ("[" + $_ + "]") ;
# [row1_firstname,row1_lastname,row1_address]
# [row2_firstname,row2_lastname,row2_address]
# [row3_firstname,row3_lastname,row3_address]
# [row4_firstname,row4_lastname,row4_address]
# [row5_firstname,row5_lastname,row5_address]
# [row6_firstname,row6_lastname,row6_address]
# powershell will do some magic to try to find a constructor with 6 parameters,
# but there isn't one
$reader = new-object System.IO.StreamReader -Argument $lines;
# New-Object: Cannot find an overload for "StreamReader" and the argument count: "6".
您可能想查看的是使用import-csv
- 例如:
$entries = Import-Csv "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
# replace this:
# while ( ($read=$Reader.ReadLine()) -ne $null)
# with this:
foreach( $entry in $entries )
#Loop through all the record in the CSV file
... do stuff with this $entry...
$entry.Jobstream = $entry.Jobstream.trimStart('PAXCL')
... etc ...
或者,您可以完全放弃对Get-Content
的呼叫:
$ElementCsv = "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
...
$Reader = New-Object System.IO.StreamReader -Argument $ElementCsv
现在 PowerShell 能够找到一个采用单个字符串参数的构造函数:
https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.-ctor?view=net-5.0#System_IO_StreamReader__ctor_System_String_.
无论如何,一旦你完成了这项工作,我想你会在你的脚本中发现一堆其他问题——你似乎混合使用了import-csv
、export-csv
、get-content
和 streamreader
s 可以读取和写入看起来像 csv 文件的文件,例如,但可能尝试一次解决一个问题 :-)。
【讨论】:
谢谢你的解释,我会照你说的做的。以上是关于如何读取内存中的 csv 文件以进行快速处理?的主要内容,如果未能解决你的问题,请参考以下文章
如何将 csv 文件中的数据排序为标准化输出,同时读取标题以对数据进行排序