当一些 <ul> 没有 <li> 子元素时，从 <ul> 数组中抓取 <li>

Posted 2023-03-05

技术标签:

【中文标题】当一些 <ul> 没有 <li> 子元素时，从 <ul> 数组中抓取 <li>【英文标题】：Scraping <li> from an array of <ul> when some of the <ul> have no <li> children 【发布时间】：2021-09-01 12:04:19 【问题描述】：

我正在尝试网络抓取数据，其中一些是 <li> 元素。我认为问题在于一些<ul> 父母没有<li> 孩子。

html 示例如下 -=

<div class="tab-pane predefined-carrier-DPDUK ">
    <img src="https://assets.easypost.com/assets/images/carriers/dpd-logo.c4b107116e903920a5794e69e1990827.svg" >
    <ul>
        <li>Parcel</li>
        <li>Pallet</li>
        <li>ExpressPak</li>
        <li>FreightParcel</li>
        <li>Freight</li>
    </ul>
</div>
<div class="tab-pane predefined-carrier-ChinaEMS ">
    <img src="https://assets.easypost.com/assets/images/carriers/china-ems-logo-ca.0c938786bd8d8f141e8fa9337a3362a4.png" >
    <p>No predefined packages for EMS.</p>
    <ul></ul>
</div>
<div class="tab-pane predefined-carrier-Estafeta ">
    <img src="https://assets.easypost.com/assets/images/carriers/estafeta-logo-ca.886242ba90c68a1d68f0e4e5a3a14419.png" >
    <ul>
        <li>ENVELOPE</li>
        <li>PARCEL</li>
    </ul>
</div>

所以某些<ul> 将不返回任何结果，即没有<li> 子级。我想出了几个“解决方案”

这个应该遍历每个<ul>，但它总是失败第二个try所以不返回<li>

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content

soup= BeautifulSoup(c, "html.parser")

all = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))

for item in all:
    print("\n".join([img['alt'] for img in item.find_all('img', alt=True)]))   
    
    try:
        print(item.find("p").text)
    except:
        print("HAS PACKAGES")
    
    try:
        for ul in all.find_all("ul"):
            for litag in ultag.find_all("li"):
                print(litag.text)
    except:
        print("has no list items")
    
    print("")

结果集是这样的：

DPD UK
HAS PACKAGES
has no list items

EMS
No predefined packages for EMS.
has no list items

Estafeta
HAS PACKAGES
has no list items

第二种解决方案是返回<li>，但我想不出办法让每个<li> 在新行上打印：

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content

soup= BeautifulSoup(c, "html.parser")

all = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))

for item in all:
    print("\n".join([img['alt'] for img in item.find_all('img', alt=True)]))   
    
    try:
        print(item.find("p").text)
    except:
        print("HAS PACKAGES")
    
    try:
            print(item.find_all("ul")[0].text)
    except:
        pass
    print("")

结果集类似这样：

DPD UK
HAS PACKAGES
ParcelPalletExpressPakFreightParcelFreight

EMS
No predefined packages for EMS.


Estafeta
HAS PACKAGES
ENVELOPEPARCEL

希望有人能让我走上正确的道路，TIA

【问题讨论】：

看起来第一个拼写错误应该是for ul in item.find_all("ul"): 感谢您的回复，我试过了，但随后返回的数据如下： > DPD >DPD 没有预定义的包。 > 联系支持 > 联系销售 > >DPD UK >有包裹 > 联系支持 > 联系销售 > >EMS >没有预定义的 EMS 包裹。 > 与支持人员交谈 > 联系销售人员 【参考方案1】：

这可能会对您有所帮助。我试图提取<ul> 并检查它是否有<li> 标签。只有这样我才会在新行上打印那些 <li> 内容。

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content

soup= BeautifulSoup(c, "html.parser")

div_tag = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))

for i in div_tag:
    uls = i.find('ul')
    # Checks if <ul> is empty. That is no <li> tags present
    if uls.text != '':
        # Gets all the <li> from <ul>
        li_tags = uls.findAll('li')
        # print the text of <li> one after other in new line
        for item in li_tags:
            print(item.text)
        print('\n')

Sample Output:

Parcel
Pallet
ExpressPak
FreightParcel
Freight

ENVELOPE
PARCEL

Parcel
Satchel

【讨论】：

太好了，谢谢！它作为一个独立的工作很好，但是当我将它插入我的for item in all: 代码块时，我得到AttributeError: 'int' object has no attribute 'text' 我将代码中的第一行更改为for i in item:，这是我收到上述错误的时候。如果我将其更改为 `for i in all:, I get all the ` 为 all 变量中的所有项目列出

以上是关于当一些 <ul> 没有 <li> 子元素时，从 <ul> 数组中抓取 <li>的主要内容，如果未能解决你的问题，请参考以下文章