Python Regex:如何在两个模式之间选择行

Posted

技术标签:

【中文标题】Python Regex:如何在两个模式之间选择行【英文标题】:Python Regex: How to select lines between two patterns 【发布时间】:2021-10-12 03:01:19 【问题描述】:

考虑一个典型的实时聊天数据如下:

Peter (08:16): 
Hi 
What's up? 
;-D

Anji Juo (09:13): 
Hey, I'm using WhatsApp!

Peter (11:17):
Could you please tell me where is the feedback?

Anji Juo (19:13): 
I don't know where it is. 

Anji Juo (19:14): 
Do you by any chance know where I can catch a taxi ?
????????????

要将此原始文本文件转换为 DataFrame,我需要编写一些正则表达式来识别列名,然后提取相应的值。

请看https://regex101.com/r/X3ubqF/1

Index(time)     Name        Message
08:16           Peter       Hi 
                            What's up? 
                            ;-D
09:13           Anji Juo    Hey, I'm using WhatsApp!
11:17           Peter       Could you please tell me where is the feedback?
19:13           Anji Juo    I don't know where it is. 
19:14           Anji Juo    Do you by any chance know where I can catch a taxi ?
                            ????????????

正则表达式r"(?P<Name>.*?)\s*\((?P<Index>(?:\d|[01]\d|2[0-3]):[0-5]\d)\)" 可以完美地提取时间和名称列的值,但我不知道如何突出显示和提取每个时间索引的特定发件人的消息。

【问题讨论】:

【参考方案1】:

可以使用re模块解析字符串(regex101):

import re

s = """
Peter (08:16): 
Hi 
What's up? 
;-D

Anji Juo (09:13): 
Hey, I'm using WhatsApp!

Peter (11:17):
Could you please tell me where is the feedback?

Anji Juo (19:13): 
I don't know where it is. 

Anji Juo (19:14): 
Do you by any chance know where I can catch a taxi ?
???
"""


all_data = []
for part in re.findall(
    r"^\s*(.*?)\s+\(([^)]+)\):\s*(.*?)(?:\n\n|\Z)", s, flags=re.M | re.S
):
    all_data.append(part)

df = pd.DataFrame(all_data, columns=["Index(time)", "Name", "Message"])
print(df)

打印:

  Index(time)   Name                                                      Message
0       Peter  08:16                                        Hi \nWhat's up? \n;-D
1    Anji Juo  09:13                                     Hey, I'm using WhatsApp!
2       Peter  11:17              Could you please tell me where is the feedback?
3    Anji Juo  19:13                                   I don't know where it is. 
4    Anji Juo  19:14  Do you by any chance know where I can catch a taxi ?\n???\n

【讨论】:

【参考方案2】:

使用

(?m)^(?P<user>.*?)\s*\((?P<hhmm>(?:\d|[01]\d|2[0-3]):[0-5]\d)\):\s*(?P<Quote>.*(?:\n(?!\n).*)*)

见regex proof。

Python code

import re

s = "Peter (08:16): \nHi \nWhat's up? \n;-D\n\nAnji Juo (09:13): \nHey, I'm using WhatsApp!\n\nPeter (11:17):\nCould you please tell me where is the feedback?\n\nAnji Juo (19:13): \nI don't know where it is. \n\nAnji Juo (19:14): \nDo you by any chance know where I can catch a taxi ?\n???\n"

regex = r"^(?P<user>.*?)\s*\((?P<hhmm>(?:\d|[01]\d|2[0-3]):[0-5]\d)\):\s*(?P<Quote>.*(?:\n(?!\n).*)*)"

print(re.findall(regex, s, re.M))

结果

[('Peter', '08:16', "Hi \nWhat's up? \n;-D"), ('Anji Juo', '09:13', "Hey, I'm using WhatsApp!"), ('Peter', '11:17', 'Could you please tell me where is the feedback?'), ('Anji Juo', '19:13', "I don't know where it is. "), ('Anji Juo', '19:14', 'Do you by any chance know where I can catch a taxi ?\n???\n')]

解释

--------------------------------------------------------------------------------
  (?m)                     set flags for this block (with ^ and $
                           matching start and end of line) (case-
                           sensitive) (with . not matching \n)
                           (matching whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  (?P<user>                   group and capture to "user" group:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of "user" group
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \(                       '('
--------------------------------------------------------------------------------
  (hhmm                     group and capture to "hmm" group:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      \d                       digits (0-9)
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [01]                     any character of: '0', '1'
--------------------------------------------------------------------------------
      \d                       digits (0-9)
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      2                        '2'
--------------------------------------------------------------------------------
      [0-3]                    any character of: '0' to '3'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    :                        ':'
--------------------------------------------------------------------------------
    [0-5]                    any character of: '0' to '5'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of "hhmm" group
--------------------------------------------------------------------------------
  \)                       ')'
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?P<Quote>                 group and capture to "Quote" group:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        \n                       '\n' (newline)
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of "Quote" group

【讨论】:

以上是关于Python Regex:如何在两个模式之间选择行的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 RegEx 提取模式之间的文本列表?

C# - RegEx - 获取两行之间的字符串

Python regex look-behind 需要固定宽度的模式

如何选择可能使用 awk/sed 多次出现的两个标记模式之间的行

在R中,我如何用regex逐行比较两列中的模式行和不匹配行?

如何在实体框架中的两个表之间进行左连接操作时从左表中选择唯一行