使用 BeautifulSoup 提取标签中的内容

Posted 2023-02-23

技术标签:

【中文标题】使用 BeautifulSoup 提取标签中的内容【英文标题】：Extract content within a tag with BeautifulSoup 【发布时间】：2011-08-25 08:36:24 【问题描述】：

我想提取内容Hello world。请注意，页面上还有多个<table> 和类似的<td colspan="2">：

<table border="0" cellspacing="2" >
  <tr>
    <td colspan="2"><b>Name: </b>Hello world</td>
  </tr>
  <tr>
...

我尝试了以下方法：

hello = soup.find(text='Name: ')
hello.findPreviousSiblings

但它什么也没返回。

另外，我在提取My home address时也遇到了问题：

<td><b>Address:</b></td>

<td>My home address</td>

我也在使用相同的方法搜索text="Address: "，但是如何向下导航到下一行并提取<td> 的内容？

【问题讨论】：

【参考方案1】：

使用以下代码通过 python beautifulSoup 从 html 标签中提取文本和内容

s = '<td>Example information</td>' # your raw html
soup =  BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html

【讨论】：

感谢您提供此代码 sn-p，它可能会提供一些有限的即时帮助。 proper explanation 将通过展示为什么这是解决问题的好方法，并使其对有其他类似问题的未来读者更有用，从而大大提高其长期价值。请edit您的回答添加一些解释，包括您所做的假设。我决定使用 .text，因为用户想从 html 中提取纯文本。在用户使用 Beautiful soup python 库解析 html 后，他可以使用“id”、“class”或任何其他标识符来查找感兴趣的标签或 html 元素，并且在这样做之后，如果他想要在任何一个中的纯文本选定的标签，他可以在标签上使用 .text 正如我上面描述的那样【参考方案2】：

from bs4 import BeautifulSoup, Tag

def get_tag_html(tag: Tag):
    return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])

【讨论】：

您好，欢迎来到 SO！虽然此代码可能会回答问题，但提供有关它如何和/或为什么解决问题的额外上下文将提高答案的长期价值。请阅读tour和How do I write a good answer?【参考方案3】：

contents 运算符适用于从 <tag>text</tag> 中提取 text。

<td>My home address</td> 示例：

s = '<td>My home address</td>'
soup =  BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address

<td><b>Address:</b></td> 示例：

s = '<td><b>Address:</b></td>'
soup =  BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:

【讨论】：

【参考方案4】：

改用下一个

>>> s = '<table border="0" cellspacing="2" ><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'

next 和 previous 让您可以按照解析器处理文档元素的顺序浏览文档元素，而兄弟方法则使用解析树

【讨论】：

它什么也不返回。 hello = soup.find(text='Name: ') hello.next “姓名：”是否出现在文档的其他任何位置？对多个 cmets 感到抱歉，因为我不知道返回键实际上发布了评论。我在想是否有更好的方法来做到这一点，以防万一有类似的文本是“名称：”。您可以检查 hello.parent.parent.name 或 hello.parent.parent.attrs 或其他任何您可以锁定的内容您介意举一个简短的例子来说明 parent.parent.name 吗？

以上是关于使用 BeautifulSoup 提取标签中的内容的主要内容，如果未能解决你的问题，请参考以下文章