比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件
Posted
技术标签:
【中文标题】比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件【英文标题】:Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl 【发布时间】:2020-09-18 17:52:25 【问题描述】:我有 2 个包含多个字段(大约 30 个字段)的 csv 文件,并且大小很大(大约 4GB)。
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
我想比较这两个文件并将差异报告到另一个 csv 文件中。在上面的示例中,Employee Vivek 和 Karthick 详细信息出现在不同的行号中,但记录数据仍然相同,因此应视为匹配。员工 Vinoth 记录应被视为不匹配,因为地址不匹配。
输出 diff.csv 文件可以包含来自 File1 和 File 2 的不匹配记录,如下所示。
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
到目前为止,我已经编写了如下代码。在此之后,我很困惑选择二进制搜索或任何其他有效方式来选择哪个选项。你能帮帮我吗?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts'file1' )
die "CSV file --file1 not specified.\n";
if ( !defined $opts'file2' )
die "CSV file --file2 not specified.\n";
my $file1 = $opts'file1';
my $file2 = $opts'file2';
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
binary => 1, auto_diag => 1, sep_char => ',', eol => $/ );
my $csv2 =
Text::CSV_XS->new(
binary => 1, auto_diag => 1, sep_char => ',', eol => $/ );
my $csvout =
Text::CSV_XS->new(
binary => 1, auto_diag => 1, sep_char => ',', eol => $/ );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' )
$f1line = $csv1->getline($fh1);
while (1)
if ( $f1empty eq 'false' )
$f1line = $csv1->getline($fh1);
if ( !defined $f1line )
$f1empty = 'true';
if ( $f1empty eq 'true' )
last;
else
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
print "End of Program" . "\n";
【问题讨论】:
如果只需要在新文件中转储文件diff。你可以使用 diff 命令 `` diff file2 file1 > file3 ``` 你会得到这个:; Vinoth,12,2548.245,"140,North Street,India" 4a4 > Karthick,10,10.245," 140,北街,印度" @Dragos Trif,这是行不通的,因为文件的排序方式不同。当您得到与请求的输出不同的输出时,您应该已经意识到这一点......(即使文件的排序相同,使用 diff 也会出现问题。) Re "我很困惑选择二进制搜索或任何其他有效的方法来选择哪个选项。",二进制搜索需要排序输入,而你不需要t 似乎已排序输入。 (如果你这样做了,你可以使用与合并排序相同的方法,一次只需要内存中每个文件的一行)。如果您可以从源中排序数据,请执行此操作。否则,请考虑使用数据库。 我经常使用Beyond Compare。它处理 CSV 文件并且可以做exactly what you want。我不知道它可以如何处理相关大小的文件。 我需要这个程序在perl中自己开发 【参考方案1】:在数据库中存储这种数量的数据是处理此类任务的最正确方法。建议至少使用SQLlite,但其他数据库MariaDB、mysql、PostgreSQL 也可以正常工作。
以下代码演示了如何在没有特殊模块的情况下实现所需的输出,但它没有考虑到可能输入数据混乱。即使差异可能只是一个额外的空格,此脚本也会报告数据记录不同。
除非您指定选项输出,否则默认输出到控制台窗口。
注意:整个文件 #1 已读入内存,请耐心等待处理大文件可能需要一段时间。
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my @args = (
'file1|f1=s',
'file2|f2=s',
'output|o=s',
'debug|d',
'help|?',
'man|m'
);
GetOptions( \%opt, @args ) or pod2usage(2);
print Dumper(\%opt) if $optdebug;
pod2usage(1) if $opthelp;
pod2usage(-exitval => 0, -verbose => 2) if $optman;
pod2usage(1) unless $optfile1;
pod2usage(1) unless $optfile2;
unlink $optoutput if defined $optoutput and -f $optoutput;
compare($optfile1,$optfile2);
sub compare
my $fname1 = shift;
my $fname2 = shift;
my $hfile1 = file2hash($fname1);
open my $fh, '<:encoding(utf8)', $fname2
or die "Couldn't open $fname2";
while(<$fh>)
chomp;
next unless /^(.*?),(.*)$/;
my($key,$data) = ($1, $2);
if( !defined $hfile1->$key )
my $msg = "$fname1 $key is missing";
say_msg($msg);
elsif( $data ne $hfile1->$key )
my $msg = "$fname1 $key,$hfile1->$key\n$fname2 $_";
say_msg($msg);
sub say_msg
my $msg = shift;
if( $optoutput )
open my $fh, '>>:encoding(utf8)', $optoutput
or die "Couldn't to open $optoutput";
say $fh $msg;
close $fh;
else
say $msg;
sub file2hash
my $fname = shift;
my %hash;
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while(<$fh>)
chomp;
next unless /^(.*?),(.*)$/;
$hash$1 = $2;
close $fh;
return \%hash;
__END__
=head1 NAME
comp_cvs - compares two CVS files and stores differense
=head1 SYNOPSIS
comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
Options:
-f1,--file1 input CVS filename #1
-f2,--file2 input CVS filename #2
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-f1,--file1>
Input CVS filename #1
=item B<-f2,--file2>
Input CVS filename #2
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
=head1 DESCRIPTION
B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
=head1 EXIT STATUS
The section describes B<EXIT STATUS> codes of the program
=head1 ENVIRONMENT
The section describes B<ENVIRONMENT VARIABLES> utilized in the program
=head1 FILES
The section describes B<FILES> which used for program's configuration
=head1 EXAMPLES
The section demonstrates some B<EXAMPLES> of the code
=head1 REPORTING BUGS
The section provides information how to report bugs
=head1 AUTHOR
The section describing author and his contanct information
=head1 ACKNOWLEDGMENT
The section to give credits people in some way related to the code
=head1 SEE ALSO
The section describing related information - reference to other programs, blogs, website, ...
=head1 HISTORY
The section gives historical information related to the code of the program
=head1 COPYRIGHT
Copyright information related to the code
=cut
测试文件的输出
file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
【讨论】:
我不能使用任何数据库,因为我们不允许。是否可以通过使用 CPAN 模块比较大文件来优化此代码? 尝试使用 BerkelyDB 或 SQLite 对于有两个 4GB 文件的情况,这不是一个好的解决方案。您不想将所有内容都读入 Perl。一般来说,依赖于将所有数据编组到内存数据结构中的程序最终会导致问题。不仅如此,如果您删除所有无关的部分并专注于关键概念,这些答案会更好。像say_msq
这样的东西只会浪费空间,甚至不能很好地利用资源。选择一个输出句柄,默认为 STDOUT,然后保留它。不要每次都重新打开文件。但是,首先不要包括分心。
@briandfoy - 对于 perl 中的这个问题,什么是最好的优化、高性能的步骤/方法?我不能为此使用数据库
“最佳”、“优化”、“高性能”仅在上下文中有意义。我对需求、资源或实际数据文件一无所知。没有一个答案可以涵盖所有情况。我开始输入一些提示,但后来决定做其他事情。我认为您应该澄清当前的运行时间和内存使用情况以及您的目标。【参考方案2】:
#!/usr/bin/env perl
use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;
my %POS;
my %chars;
open my $FILEA,'<',qFileA.txt
or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',qFileB.txt
or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',qOnlyInA.txt
or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',qInBoth.txt
or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',qOnlyInB.txt
or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
$POSFILEA=tell $FILEA;
<$FILEB>,
$POSFILEB=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';
# Scan for first character of the records involved
while (<$FILEA>)
$charssubstr($_,0,1)++;
;
while (<$FILEB>)
$charssubstr($_,0,1)--;
;
# So what characters do we need to deal with?
warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
;
my @chars=sort keys %chars;
my %_h;
# For each of the characters in our character set
for my $char (@chars)
warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
# Beginning of data sections
seek $FILEA,$POSFILEA,0;
seek $FILEB,$POSFILEB,0;
%_h=();
my $pos=tell $FILEA;
while (<$FILEA>)
next
unless (substr($_,0,1) eq $char);
# for each record save the lengthAndMD5 as the key and its start as the value
$_hlengthAndMD5(\$_)=$pos;
$pos=tell $FILEA;
;
my $_s;
while (<$FILEB>)
next
unless (substr($_,0,1) eq $char);
if (exists $_h$_s=lengthAndMD5(\$_)) # It's a duplicate
print $InBoth $_;
delete $_h$_s;
else # (Not in FILEA) It's only in FILEB
print $OnlyInB $_;
;
# only in FILEA
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
for my $key (keys %_h) # Only in FILEA
seek $FILEA,delete $_h$key,0;
print $OnlyInA scalar <$FILEA>;
;
# Should be empty
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
;
;
close $OnlyInB
or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
or die "Could NOT close 'FileA.txt' after reading! $!";
exit;
sub lengthAndMD5
return sprintf("%8.8lx-%32.32s",length($$_[0]),Digest::MD5::md5_hex($$_[0]));
;
__END__
【讨论】:
以上是关于比较 2 个 CSV 巨大的 CSV 文件并使用 perl 将差异打印到另一个 csv 文件的主要内容,如果未能解决你的问题,请参考以下文章
python脚本从巨大的(60000)JSON文件目录中提取特征到csv