Beautifulsoup获取没有下一个标签的内容

Posted 2021-04-11

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Beautifulsoup获取没有下一个标签的内容相关的知识，希望对你有一定的参考价值。

我有一些像这样的html代码

<p><span class="map-sub-title">abc</span>123</p>

我使用了Beautifulsoup，这是我的代码：

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text

我得到结果'abc123'

但是我希望结果'123'不是'abc123'

答案

如果标签内有多个东西，你仍然可以只查看字符串。使用.strings生成器：

>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

另一答案

您可以使用decompose()函数删除span标记，然后获取所需的文本。

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")

for span in soup.find_all("span", {'class':'map-sub-title'}):
    span.decompose()

print(soup.text)

另一答案

您还可以使用extract()删除不需要的标记，然后从下面的标记中获取文本。

from bs4 import BeautifulSoup

html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()

print(soup1.text)

另一答案

虽然这个线程的每个响应似乎都可以接受，但我会指出另一种方法：

soup.find("span", {'class':'map-sub-title'}).next_sibling

您可以使用next_sibling在同一个parent上的元素之间导航，在本例中为p标记。

另一答案

其中一种方法是在父标签上使用contents（在这种情况下，它是<p>）。

如果您知道字符串的位置，可以直接使用：

>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'

如果你想要一个通用的解决方案，你不知道位置，你可以检查内容的类型是否像这样NavigableString：

>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']

使用第二种方法，您将能够获得直接作为<p>标记的子项的所有文本。为了完整起见，这里还有一个例子：

>>> html = '''
... <p>
...     I want
...     <span class="map-sub-title">abc</span>
...     foo
...     <span class="map-sub-title">abc2</span>
...     text
...     <span class="map-sub-title">abc3</span>
...     only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'

以上是关于Beautifulsoup获取没有下一个标签的内容的主要内容，如果未能解决你的问题，请参考以下文章