BS4：在标签中获取文本

Posted 2023-02-23

技术标签:

【中文标题】BS4：在标签中获取文本【英文标题】：BS4: Getting text in tag 【发布时间】：2014-10-04 18:53:17 【问题描述】：

我正在用美味的汤。有这样的标签：

<li><a href="example"> s.r.o., <small>small</small></a></li>

我只想获取锚点 <a> 标记中的文本，而不是输出中 <small> 标记中的任何文本；即“s.r.o.,”

我尝试了find('li').text[0]，但它不起作用。

BS4 中有一个命令可以做到这一点吗？

【问题讨论】：

【参考方案1】：

一种选择是从a 元素的contents 中获取第一个元素：

>>> from bs4 import BeautifulSoup
>>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
 s.r.o.,

另一种方法是找到small 标签并获取previous sibling：

>>> print soup.find('small').previous_sibling
 s.r.o.,

嗯，还有各种替代/疯狂的选择：

>>> print next(soup.find('a').descendants)
 s.r.o., 
>>> print next(iter(soup.find('a')))
 s.r.o.,

【讨论】：

【参考方案2】：

使用.children

soup.find('a').children.next()
s.r.o.,

【讨论】：

谢谢，但据我所知，没有参数的 split() 使用 ' ' 作为分隔符，这在这种情况下很有用，但有些情况下，文本中包含空格和逗号，所以它赢了不行。还是我错了？你说得对，等我回来后我会看看【参考方案3】：

如果您想循环打印位于 html 字符串/网页中的锚标记的所有内容（必须使用 urllib 中的 urlopen），这可行：

from bs4 import BeautifulSoup
data = '<li><a href="example">s.r.o., <small>small</small</a></li> <li><a href="example">2nd</a></li> <li><a href="example">3rd</a></li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
    print(tag.contents[0])     #.contents method to locate text within <a> tags

输出：

s.r.o.,  
2nd
3rd

a_tag 是一个包含所有锚标签的列表；收集列表中的所有锚标记，启用组编辑（如果存在多个 <a> 标记。

>>>print(a_tag)
[<a href="example">s.r.o.,  <small>small</small></a>, <a href="example">2nd</a>, <a href="example">3rd</a>]

【讨论】：

【参考方案4】：

从文档中，可以通过调用字符串属性来检索标签的文本

soup = BeautifulSoup('<li><a href="example"> s.r.o., <small>small</small></a></li>')
res = soup.find('a')
res.small.decompose()
print(res.string)
# s.r.o.,

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring

【讨论】：

以上是关于BS4：在标签中获取文本的主要内容，如果未能解决你的问题，请参考以下文章