Beautifulsoup + HTML...如何忽略一些 h3 类

Posted 2023-02-23

技术标签:

【中文标题】Beautifulsoup + HTML...如何忽略一些 h3 类【英文标题】：Beautifulsoup + HTML...how to ignore a few h3 classes 【发布时间】：2020-09-20 02:11:47 【问题描述】：

我得到以下代码：

from bs4 import BeautifulSoup
import requests
import re

source = requests.get('https://tienda.mimo.com.ar/mimo/junior/ropa-para-ninas.html').text

soup = BeautifulSoup(source, 'lxml')

for name_product, old_price, special_price in zip(soup.select('h3', class_='titprod'), 
                                                  soup.select('span[id^="old-price"]'),
                                                  soup.select('span[id^="product-price"]')):
    print(f'Name: name_product.text.strip() |  Old price = old_price.text.strip() | Discounted price = special_price.text.strip()')

输出：

Name: Para acceder a la promoción seleccione el banco y la tarjeta de crédito que corresponda |  Old price = $ 295 | Discounted price = $ 236
Name: ¡Gracias por suscribirte al newsletter! |  Old price = $ 990 | Discounted price = $ 743
Name: Elegí por talle |  Old price = $ 2.300 | Discounted price = $ 1.725
Name: TAPABOCAS |  Old price = $ 1.550 | Discounted price = $ 1.163
Name: REMERA JR TOWN |  Old price = $ 2.990 | Discounted price = $ 2.243
Name: CAMISOLA NENA DELFI |  Old price = $ 1.990 | Discounted price = $ 1.493

如您所见，而不是使用正确的产品名称。它实际上采用了前两行标题上所说的内容（名称：Para acceder a la promoción seleccione el banco y la tarjeta de crédito que contrasta......），它们是使用相同的 css_selector (titprod)。不知道如何深入 LI 类（圆形黑色方形）以获得产品的正确名称（圆形红色方形）。因此，列表放错了位置，导致价格与每行产品的名称不对应。

【问题讨论】：

我无法访问该站点，但尝试将soup.select('h3', class_='titprod') 更改为soup.select('h3.titprod')。 select() 方法没有 class_= 参数。它工作得很好@AndrejKesely ...只是为了学习目的，为什么它不采用我不想要的方式？我会写一个带有解释的答案请不要发布代码、数据或 Tracebacks 的图像。将其复制并粘贴为文本，然后将其格式化为代码（选择它并输入ctrl-k）...Discourage screenshots of code and/or errors 【参考方案1】：

我不得不改变

soup.select('h3', class_='titprod'),

为

soup.select('h3.titprod')

正如@AndrejKessely 在 cmets 上建议的那样，它完美地工作

【讨论】：

【参考方案2】：

问题是您使用select() 方法错误。要选择所有<h3> 和class=titprod，你需要写soup.select('h3.titprod')。

class_= 参数属于 .find() 和 .find_all() 函数（不使用 CSS 选择器）。

bs4 文档中的.find_all() 链接。

bs4 文档中的CSS selectors 示例。

【讨论】：

以上是关于Beautifulsoup + HTML...如何忽略一些 h3 类的主要内容，如果未能解决你的问题，请参考以下文章