Perl 和解析杂乱的文本
Posted
技术标签:
【中文标题】Perl 和解析杂乱的文本【英文标题】:Perl and parsing messy text 【发布时间】:2010-11-05 22:21:02 【问题描述】:我有以下文字
Instructor First Number Students Who Number Students Who
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
AGEC 101 01 Larry Van Tassell 36 1 20 15 1
AGEC 278 01 Larry Makus 21 1 2 6 8 5
AGEC 278 02 Larry Makus 18 5 10 2 1
AGEC 278 03 Larry Makus 17 1 2 7 5 2 1
AGEC 301 01 Christopher McIntosh 18 9 4 5
AGEC 356 01 Joseph Guenthner 23 15 6 2
AGEC 361 01 Ruby Stroschein 11 4 1 6
AGEC 411 01 Robert Haggerty 11 6 4 1
AGEC 413 01 Robert Spear 12 3 4 5 2 1
AGEC 415 01 Larry Van Tassell 11 10 1
AGEC 526 01 Scott Matulich 7 2 5
AGEC 527 01 Stephen Cooke 5 3 2
AGED 180 01 Lori Moore 23 1 14 5 1 3
AGED 351 01 Lou Riesenberg 11 4 6 1
AMST 301 01 Walter Hesford 26 14 8 3 1
ANTH 100 01 Mark Warner 104 15 31 31 21 8 12
ANTH 220 01 Fumiyasu Arakawa 138 4 48 53 19 10 8
ANTH 230 01 Robert Sappington 28 1 7 9 9 2 1
ANTH 251 01 Donald Tyler 36 1 10 14 8 1 3
ANTH 420 01 Laura Putsche 12 3 4 2 2
ANTH 422 01 Rodney Frey 13 11 2
ANTH 427 02 Virginia Babcock 13 1 2 6 4 1
ANTH 462 01 Laura Putsche 33 3 8 20 3 1
ARBC 101 01 Anisah El-Mansouri 14 1 8 5 1
ARCH 151 01 Randall Teal 150 8 72 40 13 6 19
ARCH 253 01 Roman Montoto 23 1 9 10 2 1
ARCH 253 02 Randall Teal 22 2 9 11 2
ARCH 253 03 Xiao Hu 23 2 11 12
ARCH 353 01 Matthew Brehm 16 7 7 1
ARCH 353 02 Dillon Ellefson 16 4 11 1
ARCH 353 03 Xiao Hu 10 4 6
ARCH 385 01 Anne Marshall 68 5 29 22 11 2 4
ARCH 404 04 Matthew Brehm 10 1 5 3 1
ARCH 453 01 Roman Montoto 10 5 4 1
ARCH 453 02 Anne Marshall 13 6 5 1
ARCH 463 01 Phillip Mead 63 1 26 31 5 1
ARCH 465 01 Kenneth Carper 51 1 8 26 12 3
ARCH 483 01 D. Reese 71 2 27 35 8
ARCH 504 02 Randall Teal 15 9 6
ARCH 504 03 Kevin Van Den Wymelenberg 6 3 1 1
ARCH 504 04 Frank Jacobus 12 1 8 4
ARCH 510 02 D. Reese 13 9 4
ARCH 510 04 Robert Thornton 9 7 1
ARCH 510 05 Roman Montoto 11 2 7 4
ARCH 553 01 Bruce Haglund 14 12 2
我有这个代码/子,它获取每一行并假设产生一个相关列表:
sub GetData
my $non_nor_line = shift;
my( $subj, $crs,$sec, $rest ) = unpack "a6 a6 a6 a*", $non_nor_line;
my $name = undef;
my $upk_short = qA3A2A3A2 A3A2 A3AA5 A6;
$rest =~ m/(.+?)\d/;
$name = $1;
$rest =~ s/$1//;
$rest =~ s/^\s+//;
$rest =~ s/\s+$//;
my @rest_data = unpack($upk_short,$rest);
print $_ ."\n" foreach(@rest_data);
我不知道如何从 $rest 中获取数据,我尝试了许多 unpack 变体,但无济于事,我需要将其存储到列表中。 忽略'upk_short',它不正确,虽然我尝试了很多其他的,但似乎线条太动态了。
更新:如果有人能找到一种方法来规范化文本就好了,我的意思是对齐所有内容,以便我可以使用 Tom 的方式来解析它。
有什么想法吗?
【问题讨论】:
请稍等,我将向您展示如何始终在第一次每次时正确处理这些unpack
。
已经阅读了几个小时的 unpack 内容,我确实知道(我猜)如何使用它,以及每个 A、W 等所说的内容,但我仍然无法找到解析它的方法.
你去。请告诉我这是否有帮助。
首先,非常感谢。其次,它做了它需要做的事情,我现在面临的唯一问题是对齐(我的数据没有标准化,正如你所说....),我找不到标准化的方法。数据是从pdf中提取的,使用:pdftotext --nopgbrk --layout,我还尝试扩展甚至sed用单个空格替换\t+,没有
哦,天哪,我讨厌抓取 pdf2text 转换器;它总是很脏。我想您可能会做的是将数字与其他所有内容分开。第一部分可以在\h+
上拆分。然后你会更加小心的数字部分可能会在\h1,3
左右分开。那会让你得到一些空的字段。我只是不确定你是否可以指望这一点——在这种情况下,你必须收集具有重复起始位置的相邻行,并从该样本中推断列。真是一团糟。
【参考方案1】:
#!/usr/bin/env perl
use strict;
use warnings;
sub cut2fmt
my @positions = @_;
my $template = "";
my $lastpos = 1;
for my $place (@positions)
$template .= "A" . ($place - $lastpos) . " ";
$lastpos = $place;
$template .= "A*";
return $template;
my $fmt = cut2fmt(9, 16, 26, 45, 68, 90, 112, 117, 122, 127, 131);
my @keys = qw
subject course section
instructor_first_name instructor_last_name
completed_the_class dropped_the_class
grade_A grade_B
grade_C grade_D
grade_F
;
our @All_Records;
while (<DATA>)
next if 1 .. /^\s*\|/;
my %rec;
@rec@keys = unpack($fmt, $_);
for my $key (grep /^grade_[A-F]$/ @keys)
$rec$key ||= 0;
push @All_Records, \%rec;
for my $rec (@All_Records)
for my $key (@keys)
print "$key: $rec->$key\n";
print "\n";
__END__
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
1 2 3 4 5 6 7 8 9 0 1 2 3 4
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
| | | | | | | | | | |
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
您要做的第一件事是规范化您的数据。您的列不一致,我无法告诉您为什么会这样。也许您有需要通过expand -8
或其他方式传输的选项卡。我只包括所有相同对齐的数据。
要每次都正确设置unpack
格式,您只需要像我在它下面画一个编号的标尺。在每个字段开始的地方放置一个|
标记。记录该号码,并将其传递给包含的cut2fmt()
函数。它会将这些数字转换为pack/unpack
模板。
仅此而已。
我会告诉你这些掘金从哪里来,但我只是讨厌咄咄逼人的自我推销者,所以我如此卑微是虚伪的。我不会这样做。如果有人想做广告,很好,让他们从网站上购买垃圾邮件。然后我们这些讨厌垃圾广告的人可以使用我们的广告拦截器进行拦截。不然我就吃不消了。
【讨论】:
our
是怎么回事?在这种情况下,它比my
有什么额外的好处吗?
嗨,格雷格。 Zaid,our
是因为它是全球性的;这就是它被封顶的原因。 $fmt
也是如此,真的。我尝试限制my
变量的范围,给它们起短名称等。文件范围内的内容可能有所不同,但很难通过小示例来展示这种一致性。【参考方案2】:
您的数据格式看起来很奇怪,里面有标签吗?它看起来像三组记录,每组都有不同的布局 - 对吗?
如果数据在固定位置,一次解压所有十二列应该是可能的。如果有三种类型的布局,我会使用正则表达式来决定哪种布局适用于当前行,然后为该组记录使用适当的模板。
由于十二个数据列中的一些可能是空白的,并且某些记录的数字位于异常位置,因此可能无法将某些值分配给正确的列。
编辑
#!/usr/bin/perl
use strict;
use warnings;
my @heading = qw(Subject Course Section Firstname Lastname
Completed Dropped A B C D F);
# Use position of Instructors Last Name as a guide to line layout.
my %template = (45 => "A8 A7 A10 A19 A23 A22 A22 A5 A5 A5 A4 A4",
33 => "A7 A6 A5 A14 A14 A6 A5 A5 A5 A5 A5 A5",
30 => "A7 A6 A5 A11 A22 A6 A4 A6 A5 A2 A2 A2");
while(<DATA>)
next unless /^[A-Z]4 /;
chomp;
GetData($_);
sub GetData
my $line = shift;
for my $lastname_position (keys %template)
if (substr($line, $lastname_position-2, 2) =~ / [A-Z]/)
my @values = unpack ($template$lastname_position, $line);
my $column=0;
for my $value(@values)
print "$heading[$column] = '$value'\n";
$column++;
print "\n";
last;
__DATA__
Instructor First Number Students Who Number Students Who
Subject Course Section Instructor Last Name A B C D F
Name Completed the Class Dropped the Class
ACCT 201 01 Karin Hatheway Dial 56 6 19 9 16 2 5
ACCT 202 01 Karin Hatheway Dial 69 11 37 14 7 2 6
ACCT 205 01 Darryl Woolley 20 1 3 7 6 1 3
ACCT 205 02 Darryl Woolley 28 1 6 7 13 2
ACCT 205 03 Darryl Woolley 42 5 4 13 21 1 3
ACCT 205 04 Darryl Woolley 23 1 9 5 8 1
ACCT 205 05 Darryl Woolley 30 2 11 7 9 2 1
ACCT 205 06 Darryl Woolley 25 3 8 9 6 1 1
ACCT 275 01 Darryl Woolley 33 2 7 15 9 1 1
ACCT 310 01 Marla Kraut 16 1 1 6 7 2
ACCT 310 02 Marla Kraut 64 5 43 15 1
ACCT 310 03 Marla Kraut 72 3 11 47 10 3 1
ACCT 311 01 Karin Hatheway Dial 45 13 20 11 1
ACCT 311 02 Karin Hatheway Dial 25 10 12 3
ACCT 315 01 Jason Porter 26 6 5 8 6 1
ACCT 315 02 Jason Porter 29 1 6 10 5 7 1
ACCT 414 01 Teresa Gordon 22 1 6 6 9 1
ACCT 483 01 Glen Utzman 26 1 7 13 6
ACCT 486 01 Teresa Gordon 33 13 14 6
ACCT 492 01 Jason Wills 23 5 8 9 1
ACCT 515 01 Jeffrey Harkins 15 7 6 1
ACCT 561 01 Jason Porter 18 1 10 7 1
ADOL 526 13 Charles Gagel 21 2 19 1 1
ADOL 573 13 Martha Yopp 28 16 3 1
ADOL 574 01 Laura Holyoke 16 12 3 1
ADOL 574 11 Laura Holyoke 9 1 8 1
ADOL 574 13 Laura Holyoke 15 10 4 1
ADOL 600 13 Roger Scott 19 4 1
AERO 101 01 William Beauter 11 8 2 1
AERO 103 01 Sarah Babbitt 15 7 6 1 1
AERO 411 01 Sarah Babbitt 11 6 4 1
AERO 413 01 Sarah Babbitt 12 8 3 1
AGEC 101 01 Larry Van Tassell 36 1 20 15 1
AGEC 278 01 Larry Makus 21 1 2 6 8 5
AGEC 278 02 Larry Makus 18 5 10 2 1
AGEC 278 03 Larry Makus 17 1 2 7 5 2 1
AGEC 301 01 Christopher McIntosh 18 9 4 5
AGEC 356 01 Joseph Guenthner 23 15 6 2
AGEC 361 01 Ruby Stroschein 11 4 1 6
AGEC 411 01 Robert Haggerty 11 6 4 1
AGEC 413 01 Robert Spear 12 3 4 5 2 1
AGEC 415 01 Larry Van Tassell 11 10 1
AGEC 526 01 Scott Matulich 7 2 5
AGEC 527 01 Stephen Cooke 5 3 2
AGED 180 01 Lori Moore 23 1 14 5 1 3
AGED 351 01 Lou Riesenberg 11 4 6 1
AMST 301 01 Walter Hesford 26 14 8 3 1
ANTH 100 01 Mark Warner 104 15 31 31 21 8 12
ANTH 220 01 Fumiyasu Arakawa 138 4 48 53 19 10 8
ANTH 230 01 Robert Sappington 28 1 7 9 9 2 1
ANTH 251 01 Donald Tyler 36 1 10 14 8 1 3
ANTH 420 01 Laura Putsche 12 3 4 2 2
ANTH 422 01 Rodney Frey 13 11 2
ANTH 427 02 Virginia Babcock 13 1 2 6 4 1
ANTH 462 01 Laura Putsche 33 3 8 20 3 1
ARBC 101 01 Anisah El-Mansouri 14 1 8 5 1
ARCH 151 01 Randall Teal 150 8 72 40 13 6 19
ARCH 253 01 Roman Montoto 23 1 9 10 2 1
ARCH 253 02 Randall Teal 22 2 9 11 2
ARCH 253 03 Xiao Hu 23 2 11 12
ARCH 353 01 Matthew Brehm 16 7 7 1
ARCH 353 02 Dillon Ellefson 16 4 11 1
ARCH 353 03 Xiao Hu 10 4 6
ARCH 385 01 Anne Marshall 68 5 29 22 11 2 4
ARCH 404 04 Matthew Brehm 10 1 5 3 1
ARCH 453 01 Roman Montoto 10 5 4 1
ARCH 453 02 Anne Marshall 13 6 5 1
ARCH 463 01 Phillip Mead 63 1 26 31 5 1
ARCH 465 01 Kenneth Carper 51 1 8 26 12 3
ARCH 483 01 D. Reese 71 2 27 35 8
ARCH 504 02 Randall Teal 15 9 6
ARCH 504 03 Kevin Van Den Wymelenberg 6 3 1 1
ARCH 504 04 Frank Jacobus 12 1 8 4
ARCH 510 02 D. Reese 13 9 4
ARCH 510 04 Robert Thornton 9 7 1
ARCH 510 05 Roman Montoto 11 2 7 4
ARCH 553 01 Bruce Haglund 14 12 2
输出
Subject = 'ACCT'
Course = '201'
Section = '01'
Firstname = 'Karin'
Lastname = 'Hatheway Dial'
Completed = '56'
Dropped = '6'
A = '19'
B = '9'
C = '16'
D = '2'
F = '5'
...
Subject = 'AGEC'
Course = '101'
Section = '01'
Firstname = 'Larry'
Lastname = 'Van Tassell'
Completed = '36'
Dropped = '1'
A = '20'
B = '15'
C = ''
D = '1'
F = ''
...
Subject = 'ARCH'
Course = '553'
Section = '01'
Firstname = 'Bruce'
Lastname = 'Haglund'
Completed = '14'
Dropped = ''
A = '12'
B = '2'
C = ''
D = ''
F = ''
但数据确实需要更清洁。
【讨论】:
不,unpack
是正确的函数,不是 substr
。
这是来自我的文本文件的精确“复制和粘贴”,我尝试了许多解包组合但没有成功。实际上,如您所见,我设法以我想要的方式提取任何内容,直到“...完成课程”列,因为其余列可能是空白的,这让我的生活更加艰难。
实际上,如果有一些锚点而不是空字段,我可以通过解包来做到这一点。谢谢
@tchrist (re substr) 答案相应修正。以上是关于Perl 和解析杂乱的文本的主要内容,如果未能解决你的问题,请参考以下文章
使用 awk 或 perl 从 CSV 中提取特定列(解析)