如何快速计算字符串中连续单个字符的最大数量？

Posted 2023-03-24

技术标签:

【中文标题】如何快速计算字符串中连续单个字符的最大数量？【英文标题】：How can I quickly count the maximum number of consecutive single characters in a string? 【发布时间】：2021-07-22 22:04:31 【问题描述】：

我有一个类似的字符串：但更长

my $a = "000000001111111111000000011111111111111111111111111111111";

我正在计算“1”的数量：

my $total_1_available = $a =~ tr/1//;

而且效果非常好，而且速度非常快。

但是，我还希望（以快速方式）计算连续连续 1 的总数。 MAX COUNT 个连续的“1”。

在上面的例子中，它会返回计数：

11111111111111111111111111111111

因为这是连续的最大值。

所以，我最终得到了 TOTAL_COUNT 和 TOTAL_CONSECUTIVE_COUNT。

我让它与一个 REGEXP 一起工作，它基本上替换了 1，然后计算被替换的内容并循环......这实际上完全没问题并且工作......但它并不“感觉”正确。

理想情况下，我根本不想替换字符串，因为我正在寻找最大连续计数。

但是，我知道在 Perl 中这可能不是最快或最干净的方式。

您能教我一个更好的方法并增加我的学习吗？

按照要求，这是我当前的代码：

 my $a= "0110011001101111";
 my $total_1_available = $a =~ tr/1//;
 print "Total number of 1's = $total_1_available\n";

 my $max_c = 0;
 while ( $a=~s/(1+)/ / ) 
   $max_c = length($1) if length($1) > $max_c;
 
 print "Consecutive count   = $max_c\n";

最终代码：

use strict;
use warnings;
use Benchmark ':all';
use String::Random;

## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.

## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass       = String::Random->new;
my $string     = $pass->randregex($test_regex);

cmpthese(-1, 
    org  => sub  my $max = 0; while ($string =~ /(1+)/g)  my $len = length($1); if ($max < $len)  $max = $len   ,
    hack => sub  my $match = ""; while ($string =~ /($match1+)/g)  $match = $1;  length $match 
);

#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!

【问题讨论】：

我的字符串只有 1 和 0。是的。 tr 给了我需要的总计数，这很好。如果有意义的话，我还希望在整个字符串中连续获取 1 的最大数量。即 001110011 将返回 3，其中 tr 给出 5。你不应该使用$a（或$b），因为它是sort()使用的预定义变量。在您的一个长字符串上尝试我的新解决方案，并使用您的原始解决方案检查结果，看看它是如何工作的。我会检查性能并报告。一些有趣的选择。不要在珍珠中寻找“更清洁”的方式！如果它是正确的、快速的和可读的，那么就可以了。如果您认为一切都应该干净整洁，那么请使用 c# 之类的东西。它必须在那里！ 【参考方案1】：

使用动态正则表达式可以显着提高速度。我们可以使用一个变量来存储最大长度的字符串，然后搜索一个那么长的字符串，再加上一个或多个。理论是我们只需要寻找比我们已有的更长的字符串。

我使用了一个看起来像这样的解决方案

sub hack 
    my $match = "";                        # original search string
    while ($string =~ /($match1+)/g)    # search for $match plus 1 or more 1s
        $match = $1;                       # when found, change to new match
    
    length $match;                         # return max length

并将其与OP描述的原始方法进行比较，结果如下

use strict;
use warnings;
use Benchmark ':all';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;

cmpthese(-1, 
    org  => sub  my $max = 0; while ($string =~ /(1+)/g)  my $len = length($1); if ($max < $len)  $max = $len   ,
    hack => sub  my $match = ""; while ($string =~ /($match1+)/g)  $match = $1;  length $match 
);

输出：

       Rate    org   hack
org  7.31/s     --   -99%
hack 1372/s 18669%     --

这似乎是惊人的高，快了 19000%。这让我觉得我犯了一个错误，但我想不出那会是什么。也许我在正则表达式机器内部遗漏了一些东西，但这将是对原始解决方案的相当大的改进。

【讨论】：

是的，perl 总是有更好的方法，这就是我非常喜欢它的原因。我会回来报告的。对于使用的输入，您的解决方案执行四次循环。 org 有大约 250,000 次通行证。这些通道中的每一个都包括一个缓慢的正则表达式引擎启动 @TLP Re "这似乎是惊人的高"，等你阅读我的answer 并发现你的解决方案比我的 (naïve) XS 快 33%解决方案！ @TLP 我知道有更好的方法，你找到了。确实学到了很多。 @Mark Arnold 不客气。这是一个鼓舞人心的问题。【参考方案2】：

对于短字符串，以下方法比之前提出的所有解决方案都要快：

use List::Util qw( max );

max 0, map length, split /[^1]+/, $s

          Rate  hack  sort   org   max  mxsp    xs
hack   76879/s    --  -12%  -34%  -37%  -48%  -98%   <-- TLP
sort   87664/s   14%    --  -24%  -28%  -41%  -98%   <-- Jim Davis
org   115660/s   50%   32%    --   -6%  -22%  -98%   <-- OP
max   122504/s   59%   40%    6%    --  -17%  -98%   <-- Jim Davis
mxsp  147867/s   92%   69%   28%   21%    --  -97%   <-- ikegami (above)
xs   4950278/s 6339% 5547% 4180% 3941% 3248%    --   <-- ikegami (below)

基准代码：

use Benchmark qw( cmpthese );

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' );

cmpthese(-3, 
    org  => sub  my $max = 0; while ($string =~ /(1+)/g)  my $len = length($1); if ($max < $len)  $max = $len;   ,
    hack => sub  my $match = ""; while ($string =~ /($match1+)/g)  $match = $1;  my $max = length($match); ,
    sort => sub  my $max = ( sort  $b <=> $a  $string =~ /(1+)/g )[0]; ,
    max  => sub  my $max = max 0, map length, $string =~ /(1+)/g; ,
    mxsp => sub  my $max = max 0, map length, split /[^1]+/, $string; ,
    xs   => sub  my $max = longuest_ones_count($string); ,
);

也就是说，最快的解决方案将涉及 XS。以下是我的尝试：

IV longuest_ones_count(SV* sv) 
   IV max = 0;
   IV count = 0;

   // This code works whether the string is upgraded or downgraded.
   STRLEN len;
   char *s = SvPV(sv, len);
   while (len--) 
      if (*(s++) == '1') 
         ++count;
      
      else if (count) 
         if (max < count)
            max = count;

         count = 0;
      
   

   if (max < count)
      max = count;

   return max;

一种使用方式：

use 5.014;
use warnings;

use Inline C => <<'__';

...above code here...

__


say "$_: ", longuest_ones_count($_)
   for qw(
      0
      11111
      011111
      111110
      01110111110
      01111101110
   );

您已经看到这在短字符串中击败了其他解决方案。但是你没有短字符串。对于长字符串，这还不如 TLP 的版本快！！！

与上述相同的基准，但使用

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' ) x 10_000;

       Rate   sort    org    max   mxsp     xs   hack
sort 8.61/s     --   -25%   -31%   -44%   -99%   -99%
org  11.6/s    34%     --    -8%   -24%   -99%   -99%
max  12.5/s    46%     9%     --   -18%   -99%   -99%
mxsp 15.3/s    77%    32%    22%     --   -99%   -99%  <-- ikegami (Perl)
xs   1031/s 11870%  8822%  8118%  6653%     --   -25%  <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797%  8855%    33%     --  <-- TLP

哇，正则表达式引擎很棒！使用 XS 显然可以击败它（通过消除编译模式所需的时间），但有什么意义呢？

【讨论】：

对所有解决方案的好评。对我来说，我的排序解决方案太慢了，我什至无法将它包含在具有更长字符串的基准测试中。我猜是硬件差异。另外，OP 提到大部分输入将为零，这将与拆分解决方案协同工作。可以和xs结合吗？我相信它可以进一步加速。 :) 已回答的解决方案代表了速度的巨大提升以及原始工作示例的简单性。优化和技术实现之间总是存在平衡。谢谢大家。 @TLP 实际上，大多数零有助于您的“破解”解决方案。 @MarkArnold 是的，这就是答案最后一段的意思【参考方案3】：

我可能会这样做：

use List::Util 'max';

my $string = '01011101100000111111001';

my $longest_run = max( 0, map  length  $string =~ /(1+)/g );

获取每个匹配组 1 的长度并选择最大的。插入了一个 0，所以如果没有 undef，你就不会得到。

$ perl -MList::Util=max \
  -E 'say $_, " ", max(0, map  length  /(1+)/g) for @ARGV' \
  0 1 00010110 011101111110100110

0 0
1 1
00010110 2
011101111110100110 6

编辑：@TLP 的评论让我很好奇，因为我喜欢 sort 解决方案。

#!/usr/bin/env perl

use v5.16;
use warnings;

use Benchmark ':all';
use List::Util 'max';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100';

cmpthese(1_000_000, 
    sort => sub  my $x =    (  sort  $b <=> $a  $string =~ /(1+)/g)[0] ,
    max  => sub  my $x = max(0, map  length     $string =~ /(1+)/g)    ,
);

导致：

        Rate sort  max
sort 84890/s   --  -9%
max  93023/s  10%   --

也许更长/更短的测试字符串会产生不同的结果？

【讨论】：

由于这是一个性能问题，最好将max() 与常规sort() 进行比较。有趣！我用一个 10k 的字符串尝试了 50k 次迭代，得到了sort 724/s, max 926/s (+28%)。我希望sort 进行更多比较，但max 只通过一次——尽管它对每个都进行了length。嗯！我得到原始解决方案（仅使用 1 个length()）比 max 快 19%，比 sort 快 61%。在你的字符串上使用x 1_000。如果您可以动态搜索比当前最大值更长的字符串，那么最终的 hack 将是，例如/(1$max,/g。我想你可以通过改变你正在寻找的字符串来做，但是你不能使用/g修饰符，但必须破解存储匹配的索引，也许$+[0]。【参考方案4】：

Perl 允许您动态创建哈希，您可以使用它来进行计数。

使用该字母循环遍历 $a 的每个字符以增加散列的内容。在循环结束时，您将获得一个哈希，其中包含每个字母的键和包含每个字母的计数的值。

foreach $letter (split //, $a) 
    if $letter eq $last  
        $consecutive_count$letter++
     else 
        if ($consecutive_count$letter > $consecutive_runs$letter)
             $consecutive_runs$letter = $consecutive_count$letter;
             $consecutive_count$letter = 0;
        
       
    $counts$letter++;
    $last = $letter;    


foreach my $key (keys %counts) 
    print "$key occured $counts$letter times";
    print "longest consecutive run for $key was $consecutive_runs$key";

【讨论】：

这似乎并没有解决如何计算连续1的最大数量的问题。是的，我更新了它以添加连续的看看 B::DeParse ，它是一个内部模块，可以导出代码的 perls 内部优化表示的代码版本。这可以帮助您了解您的代码是如何运行的，并且通常可以向您展示一个轻微的错误是如何被误解的，或者代码是如何被忽略的。 perldoc.perl.org/B::Deparse

以上是关于如何快速计算字符串中连续单个字符的最大数量？的主要内容，如果未能解决你的问题，请参考以下文章