比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件

Posted 2023-03-24

技术标签:

【中文标题】比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件【英文标题】：Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl 【发布时间】：2020-09-18 17:52:25 【问题描述】：

我有 2 个包含多个字段（大约 30 个字段）的 csv 文件，并且大小很大（大约 4GB）。

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

我想比较这两个文件并将差异报告到另一个 csv 文件中。在上面的示例中，Employee Vivek 和 Karthick 详细信息出现在不同的行号中，但记录数据仍然相同，因此应视为匹配。员工 Vinoth 记录应被视为不匹配，因为地址不匹配。

输出 diff.csv 文件可以包含来自 File1 和 File 2 的不匹配记录，如下所示。

Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"

到目前为止，我已经编写了如下代码。在此之后，我很困惑选择二进制搜索或任何其他有效方式来选择哪个选项。你能帮帮我吗？

My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts'file1' ) 
    die "CSV file --file1 not specified.\n";

if ( !defined $opts'file2' ) 
    die "CSV file --file2 not specified.\n";


my $file1 = $opts'file1';
my $file2 = $opts'file2';
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
  Text::CSV_XS->new(
     binary => 1, auto_diag => 1, sep_char => ',', eol => $/  );
my $csv2 =
  Text::CSV_XS->new(
     binary => 1, auto_diag => 1, sep_char => ',', eol => $/  );
my $csvout =
  Text::CSV_XS->new(
     binary => 1, auto_diag => 1, sep_char => ',', eol => $/  );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) 
    $f1line = $csv1->getline($fh1);

while (1) 

    if ( $f1empty eq 'false' ) 
        $f1line = $csv1->getline($fh1);
    
    if ( !defined $f1line ) 
        $f1empty = 'true';
    

    if ( $f1empty eq 'true' ) 
        last;
    
    else 
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.

        
    

print "End of Program" . "\n";

【问题讨论】：

如果只需要在新文件中转储文件diff。你可以使用 diff 命令 `` diff file2 file1 > file3 ``` 你会得到这个:; Vinoth,12,2548.245,"140,North Street,India" 4a4 > Karthick,10,10.245," 140，北街，印度" @Dragos Trif，这是行不通的，因为文件的排序方式不同。当您得到与请求的输出不同的输出时，您应该已经意识到这一点......（即使文件的排序相同，使用 diff 也会出现问题。） Re "我很困惑选择二进制搜索或任何其他有效的方法来选择哪个选项。"，二进制搜索需要排序输入，而你不需要t 似乎已排序输入。（如果你这样做了，你可以使用与合并排序相同的方法，一次只需要内存中每个文件的一行）。如果您可以从源中排序数据，请执行此操作。否则，请考虑使用数据库。我经常使用Beyond Compare。它处理 CSV 文件并且可以做exactly what you want。我不知道它可以如何处理相关大小的文件。我需要这个程序在perl中自己开发 【参考方案1】：

在数据库中存储这种数量的数据是处理此类任务的最正确方法。建议至少使用SQLlite，但其他数据库MariaDB、mysql、PostgreSQL 也可以正常工作。

以下代码演示了如何在没有特殊模块的情况下实现所需的输出，但它没有考虑到可能输入数据混乱。即使差异可能只是一个额外的空格，此脚本也会报告数据记录不同。

除非您指定选项输出，否则默认输出到控制台窗口。

注意：整个文件 #1 已读入内存，请耐心等待处理大文件可能需要一段时间。

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my @args = (
            'file1|f1=s',
            'file2|f2=s',
            'output|o=s',
            'debug|d',
            'help|?',
            'man|m'
        );

GetOptions( \%opt, @args ) or pod2usage(2);

print Dumper(\%opt) if $optdebug;

pod2usage(1) if $opthelp;
pod2usage(-exitval => 0, -verbose => 2) if $optman;

pod2usage(1) unless $optfile1;
pod2usage(1) unless $optfile2;

unlink $optoutput if defined $optoutput and -f $optoutput;

compare($optfile1,$optfile2);

sub compare 
    my $fname1 = shift;
    my $fname2 = shift;

    my $hfile1 = file2hash($fname1);

    open my $fh, '<:encoding(utf8)', $fname2
        or die "Couldn't open $fname2";

    while(<$fh>) 
        chomp;
        next unless /^(.*?),(.*)$/;
        my($key,$data) = ($1, $2);
        if( !defined $hfile1->$key ) 
            my $msg = "$fname1 $key is missing";
            say_msg($msg);
         elsif( $data ne $hfile1->$key ) 
            my $msg = "$fname1 $key,$hfile1->$key\n$fname2 $_";
            say_msg($msg);
        
    


sub say_msg 
    my $msg = shift;

    if( $optoutput ) 
        open my $fh, '>>:encoding(utf8)', $optoutput
            or die "Couldn't to open $optoutput";

        say $fh $msg;

        close $fh;
     else 
        say $msg;
    


sub file2hash 
    my $fname = shift;
    my %hash;

    open my $fh, '<:encoding(utf8)', $fname
        or die "Couldn't open $fname";

    while(<$fh>) 
        chomp;
        next unless /^(.*?),(.*)$/;
        $hash$1 = $2;

    

    close $fh;

    return \%hash;


__END__

=head1 NAME

comp_cvs - compares two CVS files and stores differense 

=head1 SYNOPSIS

 comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt

 Options:
    -f1,--file1 input CVS filename #1
    -f2,--file2 input CVS filename #2
    -o,--output output filename
    -d,--debug  output debug information
    -?,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-f1,--file1>

Input CVS filename #1

=item B<-f2,--file2>

Input CVS filename #2

=item B<-o,--output>

Output filename

=item B<-d,--debug>

Print debug information.

=item B<-?,--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=back

=head1 DESCRIPTION

B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.

=head1 EXIT STATUS

The section describes B<EXIT STATUS> codes of the program

=head1 ENVIRONMENT

The section describes B<ENVIRONMENT VARIABLES> utilized in the program

=head1 FILES

The section describes B<FILES> which used for program's configuration

=head1 EXAMPLES

The section demonstrates some B<EXAMPLES> of the code

=head1 REPORTING BUGS

The section provides information how to report bugs

=head1 AUTHOR

The section describing author and his contanct information

=head1 ACKNOWLEDGMENT

The section to give credits people in some way related to the code

=head1 SEE ALSO

The section describing related information - reference to other programs, blogs, website, ...

=head1 HISTORY

The section gives historical information related to the code of the program

=head1 COPYRIGHT

Copyright information related to the code

=cut

测试文件的输出

file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"

【讨论】：

我不能使用任何数据库，因为我们不允许。是否可以通过使用 CPAN 模块比较大文件来优化此代码？尝试使用 BerkelyDB 或 SQLite 对于有两个 4GB 文件的情况，这不是一个好的解决方案。您不想将所有内容都读入 Perl。一般来说，依赖于将所有数据编组到内存数据结构中的程序最终会导致问题。不仅如此，如果您删除所有无关的部分并专注于关键概念，这些答案会更好。像say_msq 这样的东西只会浪费空间，甚至不能很好地利用资源。选择一个输出句柄，默认为 STDOUT，然后保留它。不要每次都重新打开文件。但是，首先不要包括分心。 @briandfoy - 对于 perl 中的这个问题，什么是最好的优化、高性能的步骤/方法？我不能为此使用数据库 “最佳”、“优化”、“高性能”仅在上下文中有意义。我对需求、资源或实际数据文件一无所知。没有一个答案可以涵盖所有情况。我开始输入一些提示，但后来决定做其他事情。我认为您应该澄清当前的运行时间和内存使用情况以及您的目标。【参考方案2】：

#!/usr/bin/env perl

use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;

my %POS;
my %chars;

open my $FILEA,'<',qFileA.txt
    or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',qFileB.txt
    or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',qOnlyInA.txt
    or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',qInBoth.txt
    or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',qOnlyInB.txt
    or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
    $POSFILEA=tell $FILEA;
<$FILEB>,
    $POSFILEB=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';

 # Scan for first character of the records involved
    while (<$FILEA>) 
        $charssubstr($_,0,1)++;
        ;
    while (<$FILEB>) 
        $charssubstr($_,0,1)--;
        ;
    # So what characters do we need to deal with?
    warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
    ;
my @chars=sort keys %chars;

    my %_h;
    # For each of the characters in our character set
    for my $char (@chars) 
        warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
        # Beginning of data sections
        seek $FILEA,$POSFILEA,0;
        seek $FILEB,$POSFILEB,0;
        %_h=();
        my $pos=tell $FILEA;
        while (<$FILEA>) 
            next
                unless (substr($_,0,1) eq $char);
            # for each record save the lengthAndMD5 as the key and its start as the value
            $_hlengthAndMD5(\$_)=$pos;
            $pos=tell $FILEA;
            ;
        my $_s;
        while (<$FILEB>) 
            next
                unless (substr($_,0,1) eq $char);
            if (exists $_h$_s=lengthAndMD5(\$_))  # It's a duplicate
                print $InBoth $_;
                delete $_h$_s;
                
            else  # (Not in FILEA) It's only in FILEB
                print $OnlyInB $_;
                
            ;
        # only in FILEA
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        for my $key (keys %_h)  # Only in FILEA
            seek $FILEA,delete $_h$key,0;
            print $OnlyInA scalar <$FILEA>;
            ;
        # Should be empty
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        ;
    ;

close $OnlyInB
    or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
    or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
    or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
    or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
    or die "Could NOT close 'FileA.txt' after reading! $!";
exit;

    sub lengthAndMD5 
        return sprintf("%8.8lx-%32.32s",length($$_[0]),Digest::MD5::md5_hex($$_[0]));
        ;

__END__

【讨论】：

以上是关于比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件的主要内容，如果未能解决你的问题，请参考以下文章

如何操作一个巨大的 csv 文件（> 12GB）？

从巨大的 CSV 文件中读取随机行

python脚本从巨大的（60000）JSON文件目录中提取特征到csv

Python：比较 2 个 csv 文件中的 3 列，如果相等则输出

将 2 个文件夹与 .CSV 输出进行批量比较

pyspark 使用模式将 csv 文件加载到数据框中