如何使用 Perl 计算文件中的字符、单词和行数？

Posted 2023-03-24

技术标签:

【中文标题】如何使用 Perl 计算文件中的字符、单词和行数？【英文标题】：How do I count the characters, words, and lines in a file, using Perl? 【发布时间】：2010-10-21 09:24:12 【问题描述】：

使用 Perl（不使用 wc）计算文本文件的字符数、单词数和行数的好/最佳方法是什么？

【问题讨论】：

大概没有进行系统调用来使用'wc'命令？这不是家庭作业，尽管我承认它可能看起来像一个。 【参考方案1】：

这是 perl 代码。计算单词可能有点主观，但我只是说它是任何不是空格的字符串。

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) 
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));


print("lines=$lines words=$words chars=$chars\n");

【讨论】：

非常简洁但没有混淆。对于字数，你想要：scalar(split);这将在 /\s+/ 上拆分并删除前导空字段，就像 awk 一样。作为与 glenn 相关的注释，您可以说“长度；”而不是“长度 $_;” Perl 将默认使用 $_。然而，在 split() 上使用默认值更有利，因为它甚至有一个默认的正则表达式。 @Paul Tomblin：在这里，你现在快乐吗： perl -ne 'ENDprint"$.$c $w\n"$c+=length;$w+=split' @Brad Gilbert 为什么不一路走下去：打开我的 $fh、“ 【参考方案2】：

可能会产生更好结果的 bmdhacks 答案的一个变体是使用 \s+（甚至更好的 \W+）作为分隔符。考虑字符串“The ”使用单个空白字符的分隔符将给出六个而不是四个字数。所以，试试：

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) 
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));


print("lines=$lines words=$words chars=$chars\n");

使用 \W+ 作为分隔符将阻止标点符号（除其他外）计为单词。

【讨论】：

使用 \W 会将“nit-picking”分成两个词。我不知道这是否是正确的行为，但我一直认为连字符是一个词而不是两个词。这是“你付钱，你做出选择”的事情之一。就个人而言，我通常会推出自己的正则表达式，以符合我当时需要的“单词”的定义。很多时候，拆分可能没有多大帮助，因为它是一个否定匹配。一个普通的正则表达式匹配你 do 想要的字符，通常是一个更好的主意。您当然可以使用 m/.../g 并在列表上下文中调用它来做同样的事情。这仅计算代码点，而不是字符（=字形）。它忘记设置编码。 @tchrist - 是的，我知道这一点，但认为不值得添加。你也许是对的。另一个，好点 - 我没有想到那里。我打算让它保持原样，因为它是很久以前的样子，鉴于其他答案似乎不值得改变【参考方案3】：

Word Count tool 计算文本文件中的字符、单词和行数

【讨论】：

【参考方案4】：

这里。试试这个 wc 程序的 Unicode 版本。

它会跳过非文件参数（管道、目录、套接字等）。

它采用 UTF-8 文本。

它将任何 Unicode 空白视为单词分隔符。

如果文件名末尾有.ENCODING，它也接受备用编码，如foo.cp1252、foo.latin1、foo.utf16等。

它也适用于以多种格式压缩的文件。

它提供段落、行、单词、字素、字符和字节数。

它理解所有 Unicode 换行序列。

它会警告带有换行错误的损坏文本文件。

这是一个运行它的例子：

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

来吧：

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG__WARN__  = sub 
    confess("FATALIZED WARNING: @_")  unless $^S;
;

$SIG__DIE__  = sub 
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
;

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) 
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";


process_input(\&countem);

sub countem  
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) 
        $wordcount = eval  @words = split m \pSpace+  x ; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval  @lines = split m \R     x ; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval  @lines = split m \R     x ; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval  @paras = split m \R2, x ; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) 
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
         
    

    $bytecount = tell;
    if (-e $file) 
        $bytecount = -s $file;
        if ($bytecount != -s $file) 
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        
     
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw Paras Lines Words Graphs Chars Bytes File  unless $Headers++;

    printf $mask => map(  show_undef($_)  
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
 

sub show_undef 
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
 

END  
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);


sub process_input 

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) 
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) 
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) 
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval 
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        ;

        if ($@) 
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        

        $function->($whole_file, $file);

        unless (close $fh) 
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        

     # foreach file



sub set_encoding(*$) 
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m \. ([^\s.]+) \z x) 
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) 
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        
    

    return 1 if eval 
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    ;

    for ($@) 
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    

    yuck("set_encoding: $@");

    return undef;


sub fix_extension 
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m \. ( [^.\s] +) \z x) 
        if (my $prog = $Compress$1) 
            return "$prog $path |";
         
     

    return $path;

【讨论】：

【参考方案5】：

我在搜索字符计数解决方案时偶然发现了这一点。诚然，我对 perl 几乎一无所知，因此其中一些可能不符合实际，但这是我对 newt 解决方案的调整。

首先，无论如何都有一个内置的行数变量，所以我只是使用了它。我猜这可能更有效一些。实际上，字符数包括换行符，这可能不是您想要的，所以我选择了 $_。 Perl 还抱怨 split() 的完成方式（隐式拆分，请参阅：Why does Perl complain "Use of implicit split to @_ is deprecated"?），所以我对其进行了调整。我的输入文件是 UTF-8，所以我就这样打开了它们。这可能有助于在包含非 ASCII 字符的输入文件中获得正确的字符数。

代码如下：

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) 
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;

$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

【讨论】：

【参考方案6】：

有一个Perl Power Tools 项目，其目标是重建所有 Unix bin 实用程序，主要用于那些在没有 Unix 的操作系统上的实用程序。是的，他们做到了wc。实现有点矫枉过正，但它是POSIX compliant。

当您查看true 的符合 GNU 标准的实现时，它会变得有点荒谬。

【讨论】：

大多数花哨的“真正”实现都是 POD。还是很可笑。 Schwern：我一直在重新实现相当多的 PPT 以实现 Unicode 智能。我最近完成了cat -v/od -c、expand、fmt、grep、look、rev、sort 和 wc。都比原来的有所改进。【参考方案7】：

不严肃的回答：

system("wc foo");

【讨论】：

ITYM: my ($lines, $words, $chars) = split(' ', wc foo);【参考方案8】：

以固定大小的块读取文件可能比逐行读取更有效。 wc 二进制文件执行此操作。

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) 
    open my $fh, '<', $file or do 
        warn "couldn't open $file: $!\n";
        continue;
    ;

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) 
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";

【讨论】：

我不确定这会给您带来什么好处。在幕后，perl 的运算符正在使用缓冲 IO。你在这里所做的只是用必须解释的东西重写内置的东西。是的。至少在我安装 5.8.8 的情况下，Perl 一次缓冲 4096 个字节，手动执行此操作对性能没有任何好处——正如您所怀疑的那样，如果有的话，它实际上更糟。不过，我喜欢提醒人们低层次思考 :) 那么你如何处理跨块边界分割的 UTF-8 字符，嗯？【参考方案9】：

为了能够计算 CHARS 而不是字节，请考虑以下几点：（用中文或西里尔字母和utf8保存的文件试试）

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

【讨论】：

Perl 默认使用系统语言环境。如果您的系统是现代的，系统语言环境将是 UTF-8 编码，因此 Perl IO 默认为 UTF-8。如果不是，您可能应该使用系统区域设置而不是强制 UTF-8 模式... 错了，瞬息万变。 Perl 默认使用系统语言环境，但将字符 128-255 打印为“？”为了向后兼容。要打印正确的 UTF-8，应该说 binmode($fh, ":utf8");在使用文件句柄之前。在这种情况下，“使用 utf8；”没用——它告诉 Perl 源代码可以是 UTF-8 格式，除非你有像 $áccent 或 $ümlats 这样的变量名称，否则这是不必要的。 @Chris 我的 Perl 5.8 和 5.10 都记录为将 -C SDL 作为默认值，perl -e 'print "\xe2\x81\x89\n"' 按预期生成“⁉”——而不是“???”如您所料。我认为这三个十六进制值组合成一个 UTF-8 字符。 UTF-8 字符将在 Perl 中打印。只是不是128-255的那些。在我的机器上单独尝试这三个十六进制代码中的任何一个都会给我“？”，而在它前面加上 binmode(STDOUT, ":utf8"); \xe2 给我“â”，另外两个给我非打印字符。据我所知，我没有“-C”的默认设置。 echo $'\xe2\x80\x99' | perl -ne'print length,$/' 输出 4 而echo $'\xe2\x80\x99' | perl -CSDL -ne'print length,$/' 输出 2，所以我一定是记错了，Chris 是正确的。【参考方案10】：

这可能对 Perl 初学者有帮助。我尝试模拟 MS 字数统计功能并添加了一项在 Linux 中使用 wc 未显示的功能。

行数字数带空格的字符数不带空格的字符数（wc 不会在其输出中给出这个，但 Microsoft 的文字会显示它。）

这里是网址：Counting words,characters and lines in a file

【讨论】：

以上是关于如何使用 Perl 计算文件中的字符、单词和行数？的主要内容，如果未能解决你的问题，请参考以下文章