我究竟做错了啥？使用 lxml 解析 HTML

Posted 2023-03-05

技术标签:

【中文标题】我究竟做错了啥？使用 lxml 解析 HTML【英文标题】：What am I doing wrong? Parsing HTML using lxml我究竟做错了什么？使用 lxml 解析 HTML 【发布时间】：2015-02-19 09:11:15 【问题描述】：

我正在尝试使用 lxml 解析网页，但在尝试恢复 div 中的所有文本元素时遇到了麻烦。这是我到目前为止所拥有的......

import requests
from lxml import html
page = requests.get("https://www.goodeggs.com/sfbay/missionheirloom/seasonal-chicken-stew-16oz/53c68de974e06f020000073f",verify=False)
tree = html.fromstring(page.text)
foo = tree.xpath('//section[@class="product-description"]/div[@class="description-body"]/text()')

到目前为止，“foo”带回了一个空列表 []。其他页面会带回一些内容，但不是<div> 内的标签中的所有内容。其他页面带回所有内容，因为它位于 div 的顶层。

如何恢复该 div 中的所有文本内容？谢谢！

【问题讨论】：

【参考方案1】：

text 位于两个 <p> 标记内，因此部分文本位于每个 p.text 中，而不是 div.text 中。但是，您可以通过调用text_content 方法而不是使用XPath text() 来提取<div> 的所有子项中的所有文本：

import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/" 
       "seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)

path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
    print(div.text_content())

产量

We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.

PS。 dfsq 已经建议使用 XPath ...//text()。这也有效，但与 text_content 相比，文本片段作为单独的项目返回：

In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')

In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']

In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX  BAZ']

【讨论】：

是的，这正是我需要的！谢谢。 //text() 方法有效，但获取列表中的元素不适用于我的用例。【参考方案2】：

我认为 XPath 表达式应该是：

//section[@class="product-description"]/div[@class="description-body"]//text()

UPD。正如@unutbu 所指出的，上面的表达式会将文本节点作为列表获取，因此您必须遍历它们。如果您需要将整个文本内容作为一个文本项，请查看 unutbu 的答案以获取其他选项。

【讨论】：

以上是关于我究竟做错了啥？使用 lxml 解析 HTML的主要内容，如果未能解决你的问题，请参考以下文章