Python Regex:如何在两个模式之间选择行
Posted
技术标签:
【中文标题】Python Regex:如何在两个模式之间选择行【英文标题】:Python Regex: How to select lines between two patterns 【发布时间】:2021-10-12 03:01:19 【问题描述】:考虑一个典型的实时聊天数据如下:
Peter (08:16):
Hi
What's up?
;-D
Anji Juo (09:13):
Hey, I'm using WhatsApp!
Peter (11:17):
Could you please tell me where is the feedback?
Anji Juo (19:13):
I don't know where it is.
Anji Juo (19:14):
Do you by any chance know where I can catch a taxi ?
????????????
要将此原始文本文件转换为 DataFrame,我需要编写一些正则表达式来识别列名,然后提取相应的值。
请看https://regex101.com/r/X3ubqF/1
Index(time) Name Message
08:16 Peter Hi
What's up?
;-D
09:13 Anji Juo Hey, I'm using WhatsApp!
11:17 Peter Could you please tell me where is the feedback?
19:13 Anji Juo I don't know where it is.
19:14 Anji Juo Do you by any chance know where I can catch a taxi ?
????????????
正则表达式r"(?P<Name>.*?)\s*\((?P<Index>(?:\d|[01]\d|2[0-3]):[0-5]\d)\)"
可以完美地提取时间和名称列的值,但我不知道如何突出显示和提取每个时间索引的特定发件人的消息。
【问题讨论】:
【参考方案1】:可以使用re
模块解析字符串(regex101):
import re
s = """
Peter (08:16):
Hi
What's up?
;-D
Anji Juo (09:13):
Hey, I'm using WhatsApp!
Peter (11:17):
Could you please tell me where is the feedback?
Anji Juo (19:13):
I don't know where it is.
Anji Juo (19:14):
Do you by any chance know where I can catch a taxi ?
???
"""
all_data = []
for part in re.findall(
r"^\s*(.*?)\s+\(([^)]+)\):\s*(.*?)(?:\n\n|\Z)", s, flags=re.M | re.S
):
all_data.append(part)
df = pd.DataFrame(all_data, columns=["Index(time)", "Name", "Message"])
print(df)
打印:
Index(time) Name Message
0 Peter 08:16 Hi \nWhat's up? \n;-D
1 Anji Juo 09:13 Hey, I'm using WhatsApp!
2 Peter 11:17 Could you please tell me where is the feedback?
3 Anji Juo 19:13 I don't know where it is.
4 Anji Juo 19:14 Do you by any chance know where I can catch a taxi ?\n???\n
【讨论】:
【参考方案2】:使用
(?m)^(?P<user>.*?)\s*\((?P<hhmm>(?:\d|[01]\d|2[0-3]):[0-5]\d)\):\s*(?P<Quote>.*(?:\n(?!\n).*)*)
见regex proof。
Python code:
import re
s = "Peter (08:16): \nHi \nWhat's up? \n;-D\n\nAnji Juo (09:13): \nHey, I'm using WhatsApp!\n\nPeter (11:17):\nCould you please tell me where is the feedback?\n\nAnji Juo (19:13): \nI don't know where it is. \n\nAnji Juo (19:14): \nDo you by any chance know where I can catch a taxi ?\n???\n"
regex = r"^(?P<user>.*?)\s*\((?P<hhmm>(?:\d|[01]\d|2[0-3]):[0-5]\d)\):\s*(?P<Quote>.*(?:\n(?!\n).*)*)"
print(re.findall(regex, s, re.M))
结果:
[('Peter', '08:16', "Hi \nWhat's up? \n;-D"), ('Anji Juo', '09:13', "Hey, I'm using WhatsApp!"), ('Peter', '11:17', 'Could you please tell me where is the feedback?'), ('Anji Juo', '19:13', "I don't know where it is. "), ('Anji Juo', '19:14', 'Do you by any chance know where I can catch a taxi ?\n???\n')]
解释
--------------------------------------------------------------------------------
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
(?P<user> group and capture to "user" group:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of "user" group
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(hhmm group and capture to "hmm" group:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[01] any character of: '0', '1'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
2 '2'
--------------------------------------------------------------------------------
[0-3] any character of: '0' to '3'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
[0-5] any character of: '0' to '5'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of "hhmm" group
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?P<Quote> group and capture to "Quote" group:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of "Quote" group
【讨论】:
以上是关于Python Regex:如何在两个模式之间选择行的主要内容,如果未能解决你的问题,请参考以下文章
Python regex look-behind 需要固定宽度的模式
如何选择可能使用 awk/sed 多次出现的两个标记模式之间的行