beautifulsoup .get_text() 对我的 HTML 解析不够具体

Posted 2023-02-23

技术标签:

【中文标题】beautifulsoup .get_text() 对我的 HTML 解析不够具体【英文标题】：beautifulsoup .get_text() is not specific enough for my HTML parsing 【发布时间】：2015-10-06 09:07:24 【问题描述】：

鉴于下面的 html 代码，我只想输出 h1 的文本，而不是“关于 ' 的详细信息”，它是 span 的文本（由 h1 封装）。

我当前的输出给出：

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我想要：

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这是我正在使用的 HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

这是我当前的代码：

for line in soup.find_all('h1',attrs='itemprop':'name'):
    print line.get_text()

注意：我不想只是截断字符串，因为我希望这段代码具有一些可重用性。最好是一些代码可以裁剪出任何受跨度限制的文本。

【问题讨论】：

【参考方案1】：

一种解决方案是检查字符串是否包含html：

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs='itemprop': 'name'):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

另一个解决方案（我更喜欢）是检查 bs4.element.Tag 的实例：

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs='itemprop': 'name'):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

【讨论】：

【参考方案2】：

您可以使用extract() 删除所有span 标签：

for line in soup.find_all('h1',attrs='itemprop':'name'):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

【讨论】：

以上是关于beautifulsoup .get_text() 对我的 HTML 解析不够具体的主要内容，如果未能解决你的问题，请参考以下文章