如何读取内存中的 csv 文件以进行快速处理?

Posted

技术标签:

【中文标题】如何读取内存中的 csv 文件以进行快速处理?【英文标题】:How to read a csv file in memory for fast processing? 【发布时间】:2021-07-29 23:57:23 【问题描述】:

我尝试从 csv 文件导入数据,并使用 System.IO.StreamReader 方法读取内容。但是,此方法仅返回 csv 文件中包含的行数。请告诉我,该方法是否正确使用?

  function Update-ArchiWithImportData
            # Import the  contents of the Extract_AGRe_TWS_ALL_20200925-01.csv and Create ElementCsv object
    $ElementCsv=Get-Content "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
    $startTime = get-date
    Write-Verbose "Reading $ElementCsv..."
    $Reader= New-Object System.IO.StreamReader -Argument $ElementCsv
    
    while ( ($read=$Reader.ReadLine()) -ne $null) 
            
            #Loop through all the record in the CSV file
    $NewModifiedElement= Measure-Command  ForEach($Entry in $read)
            
            if ($Entry."Script or expected file(s)" -ilike 'technical') 
                    $Entry.Jobstream=$Entry.Jobstream.trimStart('PAXCL')
            else 
                    # Get the name of jobSet without extension .ksh ou .bat
                    $Entry.Jobstream=$Entry."Script or expected file(s)" 
                   # Write-Host $Entry.Jobstream
                   # Write-Host $Entry.Jobstream.length
                    $pos_last_point = $Entry.Jobstream.LastIndexOf(".")
                    #Write-Host $pos_last_point
                    $Entry.Jobstream = $Entry.Jobstream.Substring(0,$pos_last_point).trimStart('P')
            
            $Entry
    
    
    
    # Export Extract_AGRe_TWS_ALL_20200925-01.csv in new Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file 
    $NewModifiedElement | Export-Csv "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv" -NoTypeInformation -Encoding UTF8 
    #End of the generation of the csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file, start of the jobstream grep search in the $env:USERPROFILE\Desktop\Archi\elements.csv  file and the csv X file.
    Write-Host "End of the generation of the csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv file"
    $GroupPath=$NewModifiedElement
    $stream = New-Object IO.StreamWriter($GroupPath,$true)
    $stream.WriteLine($read)
    $stream.Close()
    Write-Host -Object '✔' -ForegroundColor green
    
    
    $Reader.ReadToEnd()
    "Elapsed time: $((get-date)-$startTime)"
    $Reader.Close()
    $ReadPS.Dispose()
    # Import the files
    $ElementDifference= Import-csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv  | Select-Object -skip 1
    $ElementInLocalHostArchi=Import-csv $env:USERPROFILE\Desktop\Archi\elements.csv  | Select-Object -skip 1
    $Output=@()
     Measure-Command
            foreach ($row in $ElementDifference) 
                    foreach ($Line in $ElementInLocalHostArchi) 
                            if ($Line."Name" -contains $row."Jobstream") 
                                    $Output +=$Line
                                    Write-Host " $Output is found"
                                    Continue
                            elseif ($Line."Name" -notcontains $row."Jobstream") 
                                    $Line."Name"=$row."Jobstream"
                                    $Line."Documentation"=$row."Jobstream Description"
                                    #$Line."ID"
                                    #$Line."Type"
 #Jobstream;Jobstream Description;Op num;Job;Script or expected file(s);Server;user;location;Job Description;FIELD10
                                    $Output +=$Line
                                    Write-Host " $Output not found and was insert in new line"
                                    continue
                            
                    
              
     
    
    $Output | Export-Csv "$env:USERPROFILE\Desktop\Archi\ElementChange.csv" -NoTypeInformation -Encoding UTF8 
    $content1 = $NewModifiedElement
        $content2 = $ElementInLocalHostArchi
        $minCount=[Math]::Min($content1.Count,$content2.Count)
        $comparedLines = Compare-Object $content1 $content2 -IncludeEqual:$IncludeEqual -ExcludeDifferent:$ExcludeDifferent -SyncWindow 1 |
            Group-Object  $_.InputObject.ReadCount  | Sort-Object Name
        $comparedLines | ForEach-Object 
            $curr=$_
            switch ($_.Group[0].SideIndicator)
                "=="  $right=$left=$curr.Group[0].InputObject;break 
                "=>"  
                        $right,$left = $curr.Group[0].InputObject,$curr.Group[1].InputObject
                        if ($curr.Count -eq 1 -and [int]$curr.Name -gt $minCount) 
                            $left="N/A"
                        
                        break 
                     
                "<="  
                        $right,$left = $curr.Group[1].InputObject,$curr.Group[0].InputObject
                        if ($curr.Count -eq 1 -and [int]$curr.Name -gt $minCount)
                            $right="N/A"
                        
                        break 
                                                                                       
            
            New-Object PSObject -Property @
                Line = $_.Name
                ($ReferenceObject | Split-Path -Leaf) = $left
                ($DifferenceObject | Split-Path -Leaf) = $right
             
         | Sort-Object [int]$_.Line 
      # Import the files
    $Difference= Import-csv $env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-02.csv -Header Jobstream | Select-Object "Jobstream"
    $Reference=Import-csv $env:USERPROFILE\Desktop\Archi\elements.csv  -Header Name | Select-Object "Name"
    # Get the list of properties
    $props1 = $Difference | Get-Member -MemberType NoteProperty | Select-Object -expand Name | Sort-Object | ForEach-Object "$_"
    $props2 = $Reference | Get-Member -MemberType NoteProperty | Select-Object -expand Name | Sort-Object | ForEach-Object "$_"
    
    if(Compare-Object $props1 $props2) 
    
       # Check that properties match
    
       throw "Properties are not the same! [$props1] [$props2]"
    
     else 
    
       # Pass properties list to Compare-Object
    
       "Checking $props1"
    
       Compare-Object $Difference $Reference -Property $props1
    
    
    Update-ArchiWithImportData  # Call the function
 Read-Host -Prompt “Press Enter to exit”

   

另一方面,如果我不使用 Powershell System.IO.StreamReader Read 方法,则会生成文件,但要经过几个小时的处理。大小为 29.9 MB,包含 309905 行。文件大小为 .拜托,你能帮我解决这个问题吗

【问题讨论】:

【参考方案1】:

我不完全清楚这些行是什么意思:

$ElementCsv = Get-Content "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
...
$Reader = New-Object System.IO.StreamReader -Argument $ElementCsv

但它几乎肯定没有在做你所期望的。

默认情况下,Get-Content 返回一个字符串数组 - 文件中的每一行一个字符串,因此您将一个字符串数组传递给 StreamReader 的构造函数,而 PowerShell 正在“帮助”尝试查找构造函数采用与$ElementCsv 变量匹配的参数的重载(即其中之一:https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.-ctor?view=net-5.0)。

因为$ElementCsv 是一个数组,所以它试图找到一个构造函数,该构造函数的参数数量与您的数组中的items 数量相同(即309905),并且不出所料,找不到一个...

这个例子可能会更清楚:

# write a csv file
Set-Content -Path "c:\temp\myfile.csv" -Value @"
row1_firstname,row1_lastname,row1_address
row2_firstname,row2_lastname,row2_address
row3_firstname,row3_lastname,row3_address
row4_firstname,row4_lastname,row4_address
row5_firstname,row5_lastname,row5_address
row6_firstname,row6_lastname,row6_address
"@;

# get-content reads an array of strings - one entry per line in the file
$lines = Get-Content -Path "c:\temp\myfile.csv";

# it's definitely an array
write-host $lines.GetType().FullName
# System.Object[]

# and it's got 6 entries
write-host $lines.Length
# 6

# and these are the entries...
$lines | %  write-host ("[" + $_ + "]") ;
# [row1_firstname,row1_lastname,row1_address]
# [row2_firstname,row2_lastname,row2_address]
# [row3_firstname,row3_lastname,row3_address]
# [row4_firstname,row4_lastname,row4_address]
# [row5_firstname,row5_lastname,row5_address]
# [row6_firstname,row6_lastname,row6_address]

# powershell will do some magic to try to find a constructor with 6 parameters,
# but there isn't one
$reader = new-object System.IO.StreamReader -Argument $lines;
# New-Object: Cannot find an overload for "StreamReader" and the argument count: "6".

可能想查看的是使用import-csv - 例如:

$entries = Import-Csv "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"

# replace this:
# while ( ($read=$Reader.ReadLine()) -ne $null) 

# with this:
foreach( $entry in $entries )


    #Loop through all the record in the CSV file

    ... do stuff with this $entry...
    $entry.Jobstream = $entry.Jobstream.trimStart('PAXCL')
    ... etc ...


或者,您可以完全放弃对Get-Content 的呼叫:

$ElementCsv = "$env:USERPROFILE\Desktop\Archi\Extract_AGRe_TWS_ALL_20200925-xxxx.csv"
...
$Reader = New-Object System.IO.StreamReader -Argument $ElementCsv

现在 PowerShell 能够找到一个采用单个字符串参数的构造函数:

https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.-ctor?view=net-5.0#System_IO_StreamReader__ctor_System_String_.

无论如何,一旦你完成了这项工作,我想你会在你的脚本中发现一堆其他问题——你似乎混合使用了import-csvexport-csvget-contentstreamreaders 可以读取和写入看起来像 csv 文件的文件,例如,但可能尝试一次解决一个问题 :-)。

【讨论】:

谢谢你的解释,我会照你说的做的。

以上是关于如何读取内存中的 csv 文件以进行快速处理?的主要内容,如果未能解决你的问题,请参考以下文章

如何将 csv 文件中的数据排序为标准化输出,同时读取标题以对数据进行排序

读取块中的csv文件时出现内存不足错误

Spring Batch中如何读取多个CSV文件合并数据进行处理?

如何对大量 csv 文件进行排序以按特定顺序读取它们?

从巨大的 CSV 文件中读取随机行

R中的流处理大型csv文件