BeautifulSoup 排除特定标签内的内容

Posted 2023-02-23

技术标签:

【中文标题】BeautifulSoup 排除特定标签内的内容【英文标题】：BeautifulSoup exclude content within a certain tag(s) 【发布时间】：2015-02-20 23:55:35 【问题描述】：

我有以下项目来查找段落中的文本：

soup.find("td",  "id" : "overview-top" ).find("p",  "itemprop" : "description" ).text

如何排除 <a> 标记中的所有文本？ in <p> but not in <a> 之类的东西？

【问题讨论】：

【参考方案1】：

在p 标记中查找并加入所有text nodes，并检查其父级是否不是a 标记：

p = soup.find("td", "id": "overview-top").find("p", "itemprop": "description")

print ''.join(text for text in p.find_all(text=True) 
              if text.parent.name != "a")

演示（见无 link text 打印）：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <td id="overview-top">
...     <p itemprop="description">
...         text1
...         <a href="google.com">link text</a>
...         text2
...     </p>
... </td>
... """
>>> soup = BeautifulSoup(data)
>>> p = soup.find("td", "id": "overview-top").find("p", "itemprop": "description")
>>> print p.text

        text1
        link text
        text2
>>>
>>> print ''.join(text for text in p.find_all(text=True) if text.parent.name != "a")

        text1

        text2

【讨论】：

【参考方案2】：

使用 lxml，

import lxml.html as LH

data = """
<td id="overview-top">
    <p itemprop="description">
        text1
        <a href="google.com">link text</a>
        text2
    </p>
</td>
"""

root = LH.fromstring(data)
print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]/text()')))

产量

        text1

        text2

要同时获取<p> 的子标签的文本，只需使用双正斜杠//text()，而不是单个正斜杠：

print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]//text()')))

产量

        text1
        link text
        text2

【讨论】：

以上是关于BeautifulSoup 排除特定标签内的内容的主要内容，如果未能解决你的问题，请参考以下文章

在 Beautifulsoup Python 上排除不需要的标签

获取与具有特定值的标签相同的父标签内的标签的值

11-BeautifulSoup库详解

使用Excel中的特定排除项计算前3个分数

解析目录中的 html 文件并使用 BeautifulSoup 删除特定标签

在while循环中，无法从列表中排除特定范围内的项目