使用 BeautifulSoup 查找包含特定文本的 HTML 标签

Posted 2023-02-16

技术标签:

【中文标题】使用 BeautifulSoup 查找包含特定文本的 HTML 标签【英文标题】：Using BeautifulSoup to find a HTML tag that contains certain text 【发布时间】：2010-10-26 08:16:48 【问题描述】：

我正在尝试获取 html 文档中包含以下文本模式的元素：#\S11

<h2> this is cool #12345678901 </h2>

所以，前一个将通过使用匹配：

soup('h2',text=re.compile(r' #\S11'))

结果会是这样的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够得到所有匹配的文本（见上一行）。但我希望文本的父元素匹配，因此我可以将其用作遍历文档树的起点。在这种情况下，我希望所有 h2 元素都返回，而不是文本匹配。

想法？

【问题讨论】：

实际上，根据 BeautifulSoup 文档，h2 限制被忽略：“如果您使用文本，那么您为 name 和关键字参数提供的任何值都会被忽略。” @Rabarberski 不确定 2010 年的情况如何，但 by 2012 发现使用 text（或替换它的 string）不会忽略任何其他限制 【参考方案1】：

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S11')):
    print elem.parent

打印：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

【讨论】：

谢谢！令人困惑的是，它返回了一个看起来像 unicode 字符串的列表。感谢您的帮助。 .parent 太棒了！我从来没有想过。谢谢@nosklo。 +1 如果你想立即迭代搜索的输出，那么 for 是完美的。否则如何进行列表理解：[elem.parent for element in soup(text=re.compile(r' #\S11'))] @sotangochips 是的，起初它看起来像是返回一个普通的 unicode 字符串，但它实际上是一个带有 .parent 的 NavigableString。不得不使用 PyCharm 的调试器来实现它不是纯字符串。【参考方案2】：

当text= 用作条件时，BeautifulSoup 搜索操作提供 [a list of] BeautifulSoup.NavigableString 对象，而在其他情况下使用 BeautifulSoup.Tag。检查对象的__dict__ 以查看可供您使用的属性。在这些属性中，parent 比 previous 更受青睐，因为 changes in BS4。

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> 'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

【讨论】：

对我来说soup.find('h2', text=pattern)直接给标签，不用打电话.parent。 documentation 还表示您可以将 string（在以前的版本中为 text）参数与查找标签的参数结合使用。在这种情况下，BeautifulSoup 将返回标签【参考方案3】：

使用 bs4 (Beautiful Soup 4)，OP 的尝试完全符合预期：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S11'))

返回[<h2> this is cool #12345678901 </h2>]。

【讨论】：

以上是关于使用 BeautifulSoup 查找包含特定文本的 HTML 标签的主要内容，如果未能解决你的问题，请参考以下文章