Python 3.8 - BeautifulSoup 4 - unwrap() 不会删除所有标签

Posted 2023-05-07

技术标签:

【中文标题】Python 3.8 - BeautifulSoup 4 - unwrap() 不会删除所有标签【英文标题】：Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags 【发布时间】：2020-06-28 21:07:48 【问题描述】：

我已经通过谷歌搜索了很长一段时间，但我找不到解决这个问题的方法。如有重复请见谅。

我正在尝试从 sn-p 中删除所有 html 标记，但我不想使用 get_text()，因为可能还有一些其他标记，例如 img，我想稍后使用。 BeautifulSoup 的行为并不像我预期的那样：

from bs4 import BeautifulSoup

html = """
<div>
<div class="somewhat">
    <div class="not quite">
    </div>
    <div class="here">
    <blockquote>
        <span>
            <a href = "sth.jpg"><br />content<br /></a>
        </span>
    </blockquote>
    </div>
    <div class="not here either">
    </div>
</div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', "class":"somewhat"): # in all the "somewhat" divs
    for y in x.find_all('div', "class":"here"):    # find all the "here" divs
        for inp in y.find_all("blockquote"):         # in a "here" div find all blockquote tags for the relevant content
            for newlines in inp('br'):
                inp.br.replace_with("\n")            # replace br tags
            for link in inp('a'):
                inp.a.unwrap()                       # unwrap all a tags
            for quote in inp('span'):
                inp.span.unwrap()                    # unwrap all span tags
            for block in inp('blockquote'):
                inp.blockquote.unwrap()              # <----- should unwrap blockquote
            la_lista.append(inp)

print(la_lista)

结果如下：

[<blockquote>


content


</blockquote>]

有什么想法吗？

【问题讨论】：

【参考方案1】：

从y.find_all("blockquote")返回的类型是bs4.element.Tag，你不能用inp('blockquote')自己调用标签。

您的解决方案是删除：

            for block in inp('blockquote'):
                inp.blockquote.unwrap()

并替换：

la_lista.append(inp)

与：

la_lista.append(inp.decode_contents())

答案基于以下答案BeautifulSoup innerhtml

【讨论】：

以上是关于Python 3.8 - BeautifulSoup 4 - unwrap() 不会删除所有标签的主要内容，如果未能解决你的问题，请参考以下文章