Python BeautifulSoup 提取元素之间的文本

Posted 2023-02-23

技术标签:

【中文标题】Python BeautifulSoup 提取元素之间的文本【英文标题】：Python BeautifulSoup extract text between element 【发布时间】：2013-05-25 23:39:58 【问题描述】：

我尝试从以下 html 中提取“这是我的文本”：

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

我是这样尝试的：

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs='class' : 'MYCLASS'):
    print hit.text

但我得到了所有嵌套标签之间的所有文本以及评论。

谁能帮我把“这是我的文字”从这里弄出来？

【问题讨论】：

我也在寻找这个，以便获得我想在其他地方使用的帖子字符串。我发现this 很简单：如果汤是一次性的，可以使用soup.html.unwrap() 和soup.body.unwrap() 删除标签，这样print(soup) 将提供除这些标签之外的所有内容。 【参考方案1】：

改用.children：

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

是的，这有点像舞蹈。

输出：

>>> for hit in soup.findAll(attrs='class' : 'MYCLASS'):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

【讨论】：

@CristianCiupitu：当然，你是对的，这里没有关注。更新中。这是唯一不依赖于文本的顺序或与特定其他的位置关系的解决方案，而是从指定的标签/元素中提取所有文本同时忽略文本（或其他内容) 的子标签/元素。谢谢！这很尴尬，但它有效并解决了我的问题（我不是 OP，但有类似的需求）。【参考方案2】：

你可以使用.contents:

>>> for hit in soup.findAll(attrs='class' : 'MYCLASS'):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

【讨论】：

谢谢，但文本并不总是在同一个地方。无论如何它会起作用吗？ @ɥɔǝnqɹǝƃloɥ 唉，不是。也许使用其他人的答案数字6是什么意思？ @User 因为.contents返回一个列表，我们从列表中获取第7个元素（即第6个索引），即文本【参考方案3】：

详细了解如何导航 through the parse tree in BeautifulSoup。解析树有tags 和NavigableStrings（因为这是一个文本）。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

要向下移动解析树，您有 contents 和 string。

contents 是 Tag 和 NavigableString 对象的有序列表包含在页面元素中

如果一个标签只有一个子节点，并且那个子节点是一个字符串，子节点作为 tag.string 提供，以及 tag.contents[0]

以上，也就是说你可以得到

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

对于几个子节点，你可以有例如

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

所以在这里您可以使用contents 并在您想要的索引处获取内容。

你也可以迭代一个标签，这是一个快捷方式。例如，

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

【讨论】：

hit.string 是 None 和 hit.contents[0] 是 u'\n'，所以请为问题中的示例提供答案。所以在这里你可以玩内容并在你想要的索引处获取内容。是问题的答案【参考方案4】：

BeautifulSoup documentation 提供了一个使用提取方法从文档中删除对象的示例。在以下示例中，目标是从文档中删除所有 cmets：

移除元素

一旦你引用了一个元素，你就可以把它从使用提取方法的树。此代码删除所有 cmets 来自文档：

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                    <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

【讨论】：

【参考方案5】：

简答：soup.findAll('p')[0].next

真正的答案：您需要一个不变的参考点，您可以从该参考点到达您的目标。

您在对 Haidro 的回答的评论中提到，您想要的文本并不总是在同一个地方。找到一种感觉，它相对于某个元素在同一个地方。然后弄清楚如何让 BeautifulSoup 沿着不变的路径导航解析树。

例如，在您在原始帖子中提供的 HTML 中，目标字符串立即出现在第一个段落元素之后，并且该段落不为空。因为findAll('p') 会找到段落元素，所以soup.find('p')[0] 将是第一个段落元素。

在这种情况下，您可以使用 soup.find('p')，但 soup.findAll('p')[n] 更通用，因为您的实际场景可能需要第 5 段或类似的内容。

next 字段属性将是树中下一个已解析元素，包括子元素。所以soup.findAll('p')[0].next 包含段落的文本，soup.findAll('p')[0].next.next 将在提供的 HTML 中返回您的目标。

【讨论】：

【参考方案6】：

用你自己的汤对象：

soup.p.next_sibling.strip()

soup.p

（这取决于它是解析树中的第一个

）

soup.p

next_sibling

.strip()

*否则只需find 使用您选择的filter(s) 的元素

在解释器中，这看起来像：

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

【讨论】：

您能否添加更多解释性文字来说明这如何回答这个问题？【参考方案7】：

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs='class' : 'MYCLASS'):
  hit = hit.text.strip()
  print hit

这将打印：这是我的文本试试这个..

【讨论】：

以上是关于Python BeautifulSoup 提取元素之间的文本的主要内容，如果未能解决你的问题，请参考以下文章