在 Perl 中获取分隔符 RegExp 的第一个分隔符位置的正确方法是啥

Posted 2023-03-24

技术标签:

【中文标题】在 Perl 中获取分隔符 RegExp 的第一个分隔符位置的正确方法是啥【英文标题】：What is the proper way to get the first separator position for a separator RegExp in Perl在 Perl 中获取分隔符 RegExp 的第一个分隔符位置的正确方法是什么 【发布时间】：2021-07-16 14:40:54 【问题描述】：

我尝试在 Debian Liunx 下使用 perl 5.20 将文本或令牌与文本行分开。比如：

 RED_123-DASH.DOT: PARAM1 PARAM2 ... MANY MORE

由于\s 分隔的标记后可能有大量参数，我只想要第一个分隔符位置。根据建议的解决方案stack.overflow post form Leon Timmermans和the $-[0] variant，我采用如下代码进行测试：

#!/usr/bin/env perl
# Tokens.pl ---  Lead Token Test

use warnings;
use strict;

# Separator pattern for the token
my $TSEP = qr/^\s*[\w\-\.]+(:|\s)/;

# Example text with all ID elements inside 
my $TEXT = '  RED_123-DASH.DOT: PARAM1 PARAM2 ...';

# Subroutine for the test
# $text   - the text to strip the token from
# $global - use the global flag or not
#
# Output:
#  TKN:    Token    if position found
#  SKIP:   Original if no position found 

sub stripToken($$) 
    my ($text, $global) = @_ ;
    my $test = 0;
    $test = $text =~ /$TSEP/g if $global;
    $test = $text =~ /$TSEP/  if not $global;
    if ( $test ) 
        my $pos = pos($text);
        return "SKIP: $text" if not $pos;
        return "TKN:  ". substr($text, 0, $pos-1);
    else 
       return "ORG:  $text";
   

  
# Test with global flag OK!
print "WITH.GFLAG: ",stripToken($TEXT, 1), "\n";

# Test without global flag NOT OK!
print "NONE.GFLAG: ",stripToken($TEXT, 0), "\n";

如果您只想要第一个匹配位置，Leon 提出了一个不带/.../g 标志的match_positions 子例程。

https://***.com/a/87410/3338646

sub match_positions 
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/; # <----------- pos without /.../
    return (pos($string) - length $1, pos($string));


sub all_match_positions 
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /($regex)/g)   # <----------- pos with /.../
        push @ret, [pos($string) - length $1, pos($string)];
    
    return @ret

但这不起作用。

/usr/bin/env perl "Tokens.pl"
WITH.GFLAG: TKN:    RED_123-DASH.DOT
NONE.GFLAG: SKIP:   RED_123-DASH.DOT: PARAM1 PARAM2 ...

唯一的变体适用于/.../g 标志。

/.../ 表达式有什么问题？

【问题讨论】：

“这不起作用”是什么意思？您是否期望/g 和没有/g 的结果相同？ pos 仅适用于 /g，但由于您没有告诉我们您要做什么，我无法确定这是否是您的问题。 @Dada Leon 避免 /.../g 标志进行第一个位置测试（请参阅编辑）并获得这些位置，在我的代码中这不起作用。如您所见，由于 $pos 为零并返回“SKIP”文本，因此例行程序会跳过。参考子程序中的....if $global和... if not $global和stripToken(...,1)和stripToken(...,0)进行测试。我不明白你在说什么。此外，您还没有解释您要做什么。如果代码适用于/g，为什么你想要一个没有/g 的版本？如果你想要一个没有/g 的版本，你为什么要尝试使用pos，即使它不可能工作？ @Dada 我不明白你在评论什么..请参阅标题、第一个分隔符位置方面以及 Leon 提出的解决方案。 【参考方案1】：

经验教训 ..总是在使用解决方案之前对其进行测试！。尽管答案排名很高，但https://***.com/a/87410/3338646 中为match_positions 提出的代码不起作用，因为pos($string) 在/.../ 上下文中不起作用。

# This proposed routine does not work! 
sub match_positions_A 
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/; # <----------- pos without /.../g
    return (pos($string) - length $1, pos($string)); # <----- don't work

正确的代码将在稍后的帖子https://***.com/a/87504/3338646 的问题上下文中给出。

# This proposed routine works 
sub match_positions_B 
    my ($regex, $string) = @_;
    return if not $string =~ /$regex/; # <----------- pos without /.../g
    return ($-[0], $+[0]);

因此，该问题的一种可能的正确解决方案是：

# Subroutine for the test
# $text   - the text to strip the token from
# $global - use the global flag or not
sub stripToken($$) 
    my ($text, $global) = @_ ;
    my $test = 0;
    $test = $text =~ /$TSEP/g if $global;
    $test = $text =~ /$TSEP/  if not $global;
    if ( $test ) 
        my ($offs, $pos) = ($-[0], $+[0]); # <----------- the difference
        return "SKIP: $text" if not $pos;
        return "TKN:  ". substr($text, $offs, $pos-1);
    else 
       return "ORG: $text";

结果：

/usr/bin/env perl "Tokens.pl"
WITH.GFLAG: TKN:    RED_123-DASH.DOT
NONE.GFLAG: TKN:    RED_123-DASH.DOT

【讨论】：

以上是关于在 Perl 中获取分隔符 RegExp 的第一个分隔符位置的正确方法是啥的主要内容，如果未能解决你的问题，请参考以下文章