正则表达式-零宽断言

Posted 2020-11-27

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了正则表达式-零宽断言相关的知识，希望对你有一定的参考价值。

[toc]

一、零宽断言-介绍

零宽断言，它匹配的内容不会提取，其作用是在一个限定位置的字符串向前或向后进行匹配查找。

1.1、应用场景

排除查找，查找不含有某段字符串的行
包含查找，查找含有某段字符串的行

二、断言的分类

2.1、正先行断言

什么是正先行断言，就是在字符串相应位置之前进行查找匹配，使用 (?=exp) 匹配exp前面的位置。

import re

str = ‘abcgwcab‘
pattern = ‘bc(?=gw)‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
bc

解析：首先查找字符串”abcgwcab”中gw位置，断言为真，然后再匹配 bc，然后再向后匹配。

example:

pattern = ‘bc(?=gw)ca‘
# 匹配失败，因为找到了 gw 的位置后，断言为真，再向前匹配 bc ，再然后是从 bc 处进行匹配是 gwca ，所以会失败。

pattern = ‘bc(?=gw)gwca‘
# 匹配成功，输出结果
bcgwca

2.2、反先行断言

什么是反先行断言，使用 (?!exp) 匹配后面跟的不是exp。

import re

str = ‘abcgwcab‘
pattern = ‘bc(?!ww)gw‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
bcgw

解析：首先判断字符串是否包含bc，然后判断其后面不是ww，断言为真，然后从 bc 处进行匹配 gw。

2.3、正后发断言

什么是正后发断言，就是在字符串相应位置之后进行查找匹配， (?<=exp) 匹配exp后面的位置

import re

str = ‘abcgwcab‘
pattern = ‘(?<=gw)ca‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
ca

解析：首先判断字符串是否包含 gw ，然后查找后面是否有 ca，存在，断言为真，则从 ca 处开始继续匹配。

example:

import re

str = ‘abcgwcab‘
pattern = ‘gw(?<=gw)cab‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
gwcab

2.4、反后发断言

什么是反后发断言，就是在给定位置的字符串向前查找，(?<!exp)gw 若 gw 的前面是 exp 则为 False。反之为 True

import re

str = ‘abcgwcab‘
pattern = ‘(?<!bc)gw‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
False

解析：首先查找字符串中是否包含 gw ，然后判断 gw 前面是不是 bc ，如果是则返回 False。如果不是，则返回 True，然后从 gw 处开始匹配。

example:

import re

str = ‘abcgwcab‘
pattern = ‘gw(?<!bc)ca‘
result = re.search(pattern,str)
print(result.group())

# 输出结果
gwca

‘‘‘
在字符串中查找 ca ，然后判断其前面是不是 bc ，不是，返回 True ，然后从 ca 处开始匹配，匹配到 gw 。 则输出为 gwca
‘‘‘

三、排除查找

3.1、查找不以 `baidu` 开头的字符串

源文本

baidu.com
sina.com.cn

代码

import re

source_str = ‘baidu.com
sina.com.cn‘
str_list = source_str.split(‘
‘)
print(str_list)

for str in str_list:
    pattern = ‘^(?!baidu).*$‘
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 输出结果
sina.com.cn

解析：^(?!baidu).*$ 从行首开始匹配，查找后面不是 baidu 的字符串。(?!baidu) 这段是反先行断言

3.2、查找不以 `com` 结尾的字符串

源文本

baidu.com
sina.com.cn
www.educ.org
www.hao.cc
www.redhat.com

代码

import re

source_str = ‘baidu.com
sina.com.cn
www.educ.org
www.hao.cc
www.redhat.com‘
str_list = source_str.split(‘
‘)
print(str_list)
# [‘baidu.com‘, ‘sina.com.cn‘, ‘www.educ.org‘, ‘www.hao.cc‘, ‘www.redhat.com‘]

for str in str_list:
    pattern = ‘^.*?(?<!com)$‘
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 输出结果
sina.com.cn
www.educ.org
www.hao.cc

解析：‘^.?(?<!com)$‘ ，^从行首处匹配，`.?忽略优先，优先忽略不匹配的任何字符。(?<!com)反后发断言，匹配该位置不能是com` 字符，‘$‘ 结尾锚定符。 ‘(?<!com)$‘ 意思是，匹配结尾前面不能是 com 字符的字符串。

3.3、查找文本中不含有 `world` 的行

源文本

I hope the world will be peaceful
Thepeoplestheworldoverlovepeace
Imissyoueveryday
Aroundtheworldin80Days
I usually eat eggs at breakfast

代码

import re

source_str = ‘I hope the world will be peaceful
Thepeoplestheworldoverlovepeace
Imissyoueveryday
Aroundtheworldin80Days
I usually eat eggs at breakfast‘
str_list = source_str.split(‘
‘)
print(str_list)
# [‘I hope the world will be peaceful‘, ‘Thepeoplestheworldoverlovepeace‘, ‘Imissyoueveryday‘, ‘Aroundtheworldin80Days‘, ‘I usually eat eggs at breakfast‘]

for str in str_list:
    pattern = ‘^(?!.*world).*$‘
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 输出结果
Imissyoueveryday
I usually eat eggs at breakfast

解析：^ 首先匹配行首，(?!.*world) , 匹配行首后不能有 .*world 的字符, 也就是不能有 xxxxxxxworld 的字符。这就排除了从行首开始后面有 world 字符的情况了。

四、实战操作

4.1、日志匹配（一）

从日志文件中过滤 [ERROR] 的错误日志，但错误日志又分两种，一种是带 _eMsg 参数的，一种是不带的。

需求是过滤出所有的错误日志，但排除 _eMsg=400 的行。

源文本

[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out

[ERROR][2020-04-02T10:30:17.353+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=0f05e854e38984b3f1f20||spanid=8a980083||cspanid=||serviceName=||errormsg=Handle request failed||_eMsg=400 Bad Request||_eTrace=sprin.web.Exception$BadRequest: 400 Bad Request

[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil] http get error

代码

import re

source_str = ‘[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out
[ERROR][2020-04-02T10:30:17.353+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=0f05e854e38984b3f1f20||spanid=8a980083||cspanid=||serviceName=||errormsg=Handle request failed||_eMsg=400 Bad Request||_eTrace=sprin.web.Exception$BadRequest: 400 Bad Request
[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil]‘
str_list = source_str.split(‘
‘)
# print(str_list)

for str in str_list:
    pattern = ‘(^[ERROR].*?_eMsg(?!=400).*$)|^[ERROR](?!.*_eMsg).*‘
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 输出结果
[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out
[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil]

解析：(^[ERROR].*?_eMsg(?!=400).*$) 从行首匹配 [ERROR] ,.*? 忽略优先，优先忽略不匹配的任何字符。_eMsg(?!=400) 找到 _eMsg 字符串，匹配其后面是不是 =400 如果是返回 False。

之后 | 或逻辑符，^[ERROR](?!.*_eMsg).* 从行首匹配 [ERROR] ，然后匹配出不包含 xxxxxx_eMsg 的行。

写后面那串或逻辑的目的是为了匹配出，不包含 _eMsg 字段的错误日志。

以上是关于正则表达式-零宽断言的主要内容，如果未能解决你的问题，请参考以下文章

正则表达式-零宽断言

一、零宽断言-介绍

1.1、应用场景

二、断言的分类

2.1、正先行断言

2.2、反先行断言

2.3、正后发断言

2.4、反后发断言

三、排除查找

3.1、查找不以 baidu 开头的字符串

3.2、查找不以 com 结尾的字符串

3.3、查找文本中不含有 world 的行

四、实战操作

4.1、日志匹配（一）

3.1、查找不以 `baidu` 开头的字符串

3.2、查找不以 `com` 结尾的字符串

3.3、查找文本中不含有 `world` 的行