空字符串的正则表达式
Posted
技术标签:
【中文标题】空字符串的正则表达式【英文标题】:regex for blank string 【发布时间】:2021-12-22 02:41:53 【问题描述】:我有一个字符串:
s=
"(2021-06-29T10:53:42.647Z) [Denis]: hi
(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
(2021-06-29T11:58:29.053Z) [Nicholas]:
(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##"
我想从中提取文本。预期输出为:
comments=['hi','TA FOR SHOWING','how are you bane',' ','#END_REMOTE#','VAL 01JUL2021','##ENDED AT 08:07 GMT##']
我试过的是:
comments=re.findall(r']:\s+(.*?)\n',s)
正则表达式运行良好,但我无法将空白文本设为''
【问题讨论】:
你必须排除匹配]
像 ]:\s+([^]\n]*)$
能否提供您用于处理文本的代码?您提供的字符串文字 does not compile.
我注意到你没有接受任何your questions 的问题你能复习一下问题吗,如果发布的答案成功了,请看What should I do when someone answers my question?
@Thefourthbird 我做过...肯定会为其他人做的。
【参考方案1】:
您可以排除匹配 ]
而不是在捕获组中,如果您还想匹配最后一行的值,您可以断言字符串的结尾 $
而不是匹配强制换行符 @ 987654326@
注意\s
可以匹配换行符,否定字符类[^]]*
可以匹配换行符
]:\s+([^]]*)$
Regex demo | Python demo
import re
regex = r"]:\s+([^]]*)$"
s = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, s, re.MULTILINE))
输出
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
如果你不想跨界:
]:[^\S\n]+([^]\n]*)$
Regex demo
【讨论】:
【参考方案2】:您可以将冒号后的所有内容识别为捕获组 1 中的数组。
re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)
然后循环数组,为所有空元素分配一个空格。
>>> import re
>>>
>>> s= """
... (2021-06-29T10:53:42.647Z) [Denis]: hi
... (2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
... (2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
... (2021-06-29T11:58:29.053Z) [Nicholas]:
... (2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
... (2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
... (2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##
... """
>>>
>>> talk = [re.sub('^$', ' ', w) for w in re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)]
>>> print(talk)
['hi', 'TA FOR SHOWING', 'how are you bane', ' ', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
【讨论】:
【参考方案3】:这是你想要的吗?
comments = re.findall(r']:\s(.*?)\n',s)
如果:
后面的空格总是一个空格,那么\s+
应该是\s
。 \s+
表示一个或多个空格。
【讨论】:
【参考方案4】:使用您显示的示例,请尝试以下正则表达式。
^\(\d4-\d2-\d2T(?:\d2:)2\d2\.\d3Z\)\s+\[[^]]*\]:\s+([^)]*)$
Online demo for above regex
说明:为上述添加详细说明。
^\(\d4-\d2-\d2 ##Matching from starting of line ( followed by 4 digits-2 digits- 2 digits here.
T(?:\d2:)2 ##Matching T followed by a non-capturing group which is matching 2 digits followed by colon 2 times.
\d2\.\d3Z\)\s+ ##Matching 2 digits followed by dot followed by 3 digits Z and ) followed by space(s).
\[[^]]*\]:\s+ ##Matching literal [ till first occurrence of ] followed by ] colon and space(s).
([^)]*)$ ##Creating 1st capturing group which has everything till next occurrence of `)`.
使用 Python3x:
import re
regex = r"^\(\d4-\d2-\d2T(?:\d2:)2\d2\.\d3Z\)\s+\[[^]]*\]:\s+([^)]*)$"
varVal = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, varVal, re.MULTILINE))
OP 显示的示例输出如下:
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
【讨论】:
以上是关于空字符串的正则表达式的主要内容,如果未能解决你的问题,请参考以下文章