空字符串的正则表达式

Posted

技术标签:

【中文标题】空字符串的正则表达式【英文标题】:regex for blank string 【发布时间】:2021-12-22 02:41:53 【问题描述】:

我有一个字符串:

s=

"(2021-06-29T10:53:42.647Z) [Denis]: hi
(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane 
(2021-06-29T11:58:29.053Z) [Nicholas]: 
(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##"

我想从中提取文本。预期输出为:

comments=['hi','TA FOR SHOWING','how are you bane',' ','#END_REMOTE#','VAL 01JUL2021','##ENDED AT 08:07 GMT##'] 

我试过的是:

comments=re.findall(r']:\s+(.*?)\n',s) 

正则表达式运行良好,但我无法将空白文本设为''

【问题讨论】:

你必须排除匹配 ]]:\s+([^]\n]*)$ 能否提供您用于处理文本的代码?您提供的字符串文字 does not compile. 我注意到你没有接受任何your questions 的问题你能复习一下问题吗,如果发布的答案成功了,请看What should I do when someone answers my question? @Thefourthbird 我做过...肯定会为其他人做的。 【参考方案1】:

您可以排除匹配 ] 而不是在捕获组中,如果您还想匹配最后一行的值,您可以断言字符串的结尾 $ 而不是匹配强制换行符 @ 987654326@

注意\s可以匹配换行符,否定字符类[^]]*可以匹配换行符

]:\s+([^]]*)$

Regex demo | Python demo

import re

regex = r"]:\s+([^]]*)$"

s = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
    "(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")

print(re.findall(regex, s, re.MULTILINE))

输出

['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##'] 

如果你不想跨界:

]:[^\S\n]+([^]\n]*)$

Regex demo

【讨论】:

【参考方案2】:

您可以将冒号后的所有内容识别为捕获组 1 中的数组。

re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s) 

然后循环数组,为所有空元素分配一个空格。

>>> import re
>>>
>>> s= """
... (2021-06-29T10:53:42.647Z) [Denis]: hi
... (2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
... (2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
... (2021-06-29T11:58:29.053Z) [Nicholas]:
... (2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
... (2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
... (2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##
... """
>>>
>>> talk = [re.sub('^$', ' ', w) for w in re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)]
>>> print(talk)
['hi', 'TA FOR SHOWING', 'how are you bane', ' ', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

【讨论】:

【参考方案3】:

这是你想要的吗?

comments = re.findall(r']:\s(.*?)\n',s)

如果: 后面的空格总是一个空格,那么\s+ 应该是\s\s+ 表示一个或多个空格。

【讨论】:

【参考方案4】:

使用您显示的示例,请尝试以下正则表达式。

^\(\d4-\d2-\d2T(?:\d2:)2\d2\.\d3Z\)\s+\[[^]]*\]:\s+([^)]*)$

Online demo for above regex

说明:为上述添加详细说明。

^\(\d4-\d2-\d2  ##Matching from starting of line ( followed by 4 digits-2 digits- 2 digits here.
T(?:\d2:)2        ##Matching T followed by a non-capturing group which is matching 2 digits followed by colon 2 times.
\d2\.\d3Z\)\s+    ##Matching 2 digits followed by dot followed by 3 digits Z and ) followed by space(s).
\[[^]]*\]:\s+         ##Matching literal [ till first occurrence of ] followed by ] colon and space(s).
([^)]*)$              ##Creating 1st capturing group which has everything till next occurrence of `)`.

使用 Python3x:

import re
regex = r"^\(\d4-\d2-\d2T(?:\d2:)2\d2\.\d3Z\)\s+\[[^]]*\]:\s+([^)]*)$"
varVal = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
    "(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")

print(re.findall(regex, varVal, re.MULTILINE))

OP 显示的示例输出如下:

['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

【讨论】:

以上是关于空字符串的正则表达式的主要内容,如果未能解决你的问题,请参考以下文章

Java 判断以数字开头的字串的正则表示式怎么写?

字符串正则表达式

JavaScript 正则表达式

JS正则表达式

正则表达式

正则grep