Python - 使用正则表达式查找多个匹配项并将它们打印出来[重复]

Posted 2023-02-23

技术标签:

【中文标题】Python - 使用正则表达式查找多个匹配项并将它们打印出来[重复]【英文标题】：Python - Using regex to find multiple matches and print them out [duplicate] 【发布时间】：2011-12-05 05:11:12 【问题描述】：

我需要从 html 源文件中查找表单的内容，我做了一些搜索，发现了很好的方法，但问题是它只打印出第一次找到的，我怎样才能遍历它并输出所有表单内容，而不仅仅是第一个？

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...

【问题讨论】：

你真的不想用正则表达式解析 HTML。 ***.com/questions/1732348/… 请参考这个[***.com/questions/3873361/…[1]：***.com/questions/3873361/… 【参考方案1】：

Do not use regular expressions to parse HTML.

但如果您需要在字符串中查找所有正则表达式匹配项，请使用 findall 函数。

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']

【讨论】：

re.S 是做什么的？使'.' 特殊字符完全匹配任何字符，包括换行符；如果没有这个标志，'.' 将匹配任何除了换行符。 (docs.python.org/2/library/re.html#re.S) 哦，我明白了，我确实去了网页，但没看懂文档，因为 re.S 下没有任何内容，但现在我知道如何阅读文档，re.S 和 re.DOTALL是一样的...谢谢！不客气！ re.DOTALL更清楚了，我已经更新了答案。这是最好的方法。只是为了确认一下，因为 findall 返回一个普通数组，所以使用 match[0]、matches[1] 等访问结果【参考方案2】：

而不是使用re.search 使用re.findall 它将返回List 中的所有匹配项。或者你也可以使用re.finditer（我最喜欢使用它）它会返回一个Iterator Object，你可以使用它来遍历所有找到的匹配项。

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
    print match.group(1)

【讨论】：

re.S 是做什么的？ re.finditer 正是我所需要的！谢谢！ @Pinocchio docs 说：re.S 与 re.DOTALL 相同

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

（发布此消息是因为我相信像我这样的人经常来 ***.com 快速找到答案）【参考方案3】：

为此目的使用正则表达式是错误的方法。由于您使用的是 python，因此您有一个非常棒的库可用于从 HTML 文档中提取部分：BeautifulSoup。

【讨论】：

哦，我不知道，我昨天才发现 Python。 :)

以上是关于Python - 使用正则表达式查找多个匹配项并将它们打印出来[重复]的主要内容，如果未能解决你的问题，请参考以下文章