Python中的正则表达式-re模块

Posted 2020-09-28

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python中的正则表达式-re模块相关的知识，希望对你有一定的参考价值。

有时候我们需要模糊查找我们需要的字符串等值，这个时候需要用到正则表达式。

正则表达式的使用，在python中需要引入re包

import re

1、首先了解下正则表达式的常用语法

——单个字符

.	任意的一个字符
a\|b	字符a或字符b
[afg]	a或者f或者g的一个字符
[0-4]	0-4范围内的一个字符
[a-f]	a-f范围内的一个字符
[^a]	不是a的一个字符
\s	一个空格
\S	一个非空格
\d	[0-9]，即0-9的任意字符
\D	[^0-9]，即非0-9的任意字符
\w	[0-9a-zA-Z]
\W	[^0-9a-zA-Z]
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如，“er\b”可以匹配“never”中的“er”，但不能匹配“verb”中的“er”
\B	匹配非单词边界。“er\B”能匹配“verb”中的“er”，但不能匹配“never”中的“er”

——重复

*	重复>=0次
+	重复>=1次
？	重复0次或是1次
{m}	重复m次，如[01]{2}匹配字符串00或11或01或10
{m,n}	重复m-n次，如a{1,3}匹配字符串a或aa或aaa

——位置

^	字符串的起始位置
$	字符串的结尾位置

——返回控制

对搜索的结果进行进一步精简信息，可以使用小括号扩住对应的正则表达式。如

m = re.search("output_(\d{4}).*(\d{4})", "output_1986a.txt1233")

其中字符串匹配两个(\d{4})，最后可以输出1986和1233两个。分别为m.group(1)和m.group(2)

search()方法是在整个字符串中找，下面匹配了两组字符串，即两个小括号里面的内容，所以如果写match.group(3)就是报错，不存在该组

。如果给分组添加别名的话，就可以用到groupdict()，使用方法如下

>>> match = re.search(r‘(?P<first>\bt\w+)\W+(?P<second>\w+)‘, ‘This is test for python group‘)     
>>> print match
<_sre.SRE_Match object at 0x23f6250>
>>> print match.group()
test for
>>> print match.group(0)
test for
>>> print match.group(1)
test
>>> print match.group(2)
for
>>> print match.groupdict()     
{‘second‘: ‘for‘, ‘first‘: ‘test‘}
>>> print match.groupdict()[‘first‘]
test
>>> print match.groupdict()[‘second‘]
for

2、re中常用的方法

python通过re模块提供对正则表达式的支持，使用re模块一般是先将正则表达式的字符串形式编译成Pattern对象，然后用Pattern对象来处理文本得到一个匹配的结果，也就是一个Match对象，最后通过Match得到我们的信息并进行操作

1）compile方法

>>> help(re.compile)

Help on function compile in module re:

compile(pattern, flags=0)

Compile a regular expression pattern, returning a pattern object.

上面可以看到compile返回一个pattern对象。其中第二个参数flags是匹配模式，可以使用按位或“|”表示同时生效，也可以在正则表达式字符串中指定。pattern对象是不能直接实例化的，只能通过compile方法得到。匹配模式：

1).re.I(re.IGNORECASE): 忽略大小写

2).re.M(MULTILINE): 多行模式，改变‘^‘和‘$‘的行为

3).re.S(DOTALL): 点任意匹配模式，改变‘.‘的行为

4).re.L(LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定

5).re.U(UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性

6).re.X(VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释

如下代码：

import re

pattern = re.compile(r‘re‘)
pattern.match(‘This is re module of python‘)
re.compile(r‘re‘, ‘This is re module of python‘)
# 以上两种方式是一样的
# 以下两种方式是一样的
pattern1 = re.compile(r"""\d + #整数部分
                          \.   #小数点
                          \d * #小数部分""", re.X)
pattern2 = re.compile(r‘\d+\.\d*‘)

2）match方法

>>> help(re.match)

Help on function match in module re:

match(pattern, string, flags=0)

Try to apply the pattern at the start of the string, returning

a match object, or None if no match was found.

match方法是对字符串的开头进行匹配。如果匹配到则返回一个match对象；如果匹配失败，则返回None。这个flags是编译pattern时指定的模式。group是Match对象的方法，表示得到的某个组的匹配。如果使用分组来查找字符串的各个部分，可以通过group得到每个组匹配到的字符串。

>>> match = re.match(r‘This‘, ‘This is re module of python‘)

>>> print match

<_sre.SRE_Match object at 0x0000000002C26168>

>>> print match.group()

This

>>> match = re.match(r‘python‘, ‘This is re module of python‘)

>>> print match

None

3）search方法

>>> help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)

Scan through string looking for a match to the pattern, returning

a match object, or None if no match was found.

search()方法是在整个字符串中找，而match只是在字符串的开头找，上面匹配了两组字符串，即两个小括号里面的内容，所以如果写match.group(3)就是报错，不存在该组。如果给分组添加别名的话，就可以用到groupdict()，使用方法如下

>>> match = re.search(r‘(?P<first>\bt\w+)\W+(?P<second>\w+)‘, ‘This is test for python group‘)     
>>> print match
<_sre.SRE_Match object at 0x23f6250>
>>> print match.group()
test for
>>> print match.group(0)
test for
>>> print match.group(1)
test
>>> print match.group(2)
for
>>> print match.groupdict()     
{‘second‘: ‘for‘, ‘first‘: ‘test‘}
>>> print match.groupdict()[‘first‘]
test
>>> print match.groupdict()[‘second‘]
for

4）split方法

>>> help(re.split)

Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)

Split the source string by the occurrences of the pattern,

returning a list containing the resulting substrings.

按匹配到的字符串来分隔给定的字符串，然后返回一个列表，maxsplit参数为最大的分隔次数。

>>> results = re.split(r‘\d+‘, ‘fasdf12fasdf4fasf1fasdf123‘)

>>> type(results)

>>> print results

[‘fasdf‘, ‘fasdf‘, ‘fasf‘, ‘fasdf‘, ‘‘]

>>> results = re.split(r‘-‘, ‘2013-11-12‘)

>>> print results

[‘2013‘, ‘11‘, ‘12‘]

5）findall方法

>>> help(re.findall)

Help on function findall in module re:

findall(pattern, string, flags=0)

Return a list of all non-overlapping matches in the string.

If one or more groups are present in the pattern, return a

list of groups; this will be a list of tuples if the pattern

has more than one group.

Empty matches are included in the result.

findall方法返回一个列表，里面方的是所有匹配到的字符串。如果我们的正则表达式没有给他们分组，那么就是匹配到的字符串；如果进行了分组，那么就是以元组的方式放在列表中

>>> results = re.findall(r‘\bt\w+\W+\w+‘, ‘this is test for python findall‘) 
>>> results
[‘this is‘, ‘test for‘]
>>> results = re.findall(r‘(\bt\w+)\W+(\w+)‘, ‘this is test for python findall‘)
>>> results
[(‘this‘, ‘is‘), (‘test‘, ‘for‘)]

6）sub和subn方法

sub(pattern, repl, string, count=0)

subn(pattern, repl, string, count=0)

sub方法：先通过正则表达式匹配string中的字符串，匹配到了再用repl来替换，count表示要替换的次数，不传参表示全部替换，返回的是替换过后的字符串。repl可以是一个字符串，也可以是一个方法，是方法的时候，必须有一个参数就是Match对象，必须返回一个用于替换的字符串。通过上面的代码可以看出，这个Match对象就是匹配到的Match对象，还记得match和search方法的返回值吧。如果要对匹配到的字符串做改变，用第二种方式会清晰一点

>>> print re.sub(r‘(\w+) (\w+)‘, r‘\2 \1‘, ‘i say, hello world!‘)
say i, world hello!

subn方法和sub方法基本上是一样的，只是sub返回的是替换后的字符串，而subn返回的是一个元组，这个元组有两个元素，第一个是替换过后的字符串，第二个是number，也就是替换的次数，如果我们后面指定替换的次数后，那么这个number就和我们指定的count一样

>>> print re.subn(r‘(\w+) (\w+)‘, r‘\2 \1‘, ‘i say, hello world!‘)

(‘say i, world hello!‘, 2)

>>> print re.subn(r‘(\w+) (\w+)‘, r‘\2 \1‘, ‘i say, hello world!‘, count=1)

(‘say i, hello world!‘, 1)

以上是关于Python中的正则表达式-re模块的主要内容，如果未能解决你的问题，请参考以下文章