当一些 <ul> 没有 <li> 子元素时,从 <ul> 数组中抓取 <li>
Posted
技术标签:
【中文标题】当一些 <ul> 没有 <li> 子元素时,从 <ul> 数组中抓取 <li>【英文标题】:Scraping <li> from an array of <ul> when some of the <ul> have no <li> children 【发布时间】:2021-09-01 12:04:19 【问题描述】:我正在尝试网络抓取数据,其中一些是 <li>
元素。我认为问题在于一些<ul>
父母没有<li>
孩子。
html 示例如下 -=
<div class="tab-pane predefined-carrier-DPDUK ">
<img src="https://assets.easypost.com/assets/images/carriers/dpd-logo.c4b107116e903920a5794e69e1990827.svg" >
<ul>
<li>Parcel</li>
<li>Pallet</li>
<li>ExpressPak</li>
<li>FreightParcel</li>
<li>Freight</li>
</ul>
</div>
<div class="tab-pane predefined-carrier-ChinaEMS ">
<img src="https://assets.easypost.com/assets/images/carriers/china-ems-logo-ca.0c938786bd8d8f141e8fa9337a3362a4.png" >
<p>No predefined packages for EMS.</p>
<ul></ul>
</div>
<div class="tab-pane predefined-carrier-Estafeta ">
<img src="https://assets.easypost.com/assets/images/carriers/estafeta-logo-ca.886242ba90c68a1d68f0e4e5a3a14419.png" >
<ul>
<li>ENVELOPE</li>
<li>PARCEL</li>
</ul>
</div>
所以某些<ul>
将不返回任何结果,即没有<li>
子级。我想出了几个“解决方案”
这个应该遍历每个<ul>
,但它总是失败第二个try
所以不返回<li>
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content
soup= BeautifulSoup(c, "html.parser")
all = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))
for item in all:
print("\n".join([img['alt'] for img in item.find_all('img', alt=True)]))
try:
print(item.find("p").text)
except:
print("HAS PACKAGES")
try:
for ul in all.find_all("ul"):
for litag in ultag.find_all("li"):
print(litag.text)
except:
print("has no list items")
print("")
结果集是这样的:
DPD UK
HAS PACKAGES
has no list items
EMS
No predefined packages for EMS.
has no list items
Estafeta
HAS PACKAGES
has no list items
第二种解决方案是返回<li>
,但我想不出办法让每个<li>
在新行上打印:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content
soup= BeautifulSoup(c, "html.parser")
all = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))
for item in all:
print("\n".join([img['alt'] for img in item.find_all('img', alt=True)]))
try:
print(item.find("p").text)
except:
print("HAS PACKAGES")
try:
print(item.find_all("ul")[0].text)
except:
pass
print("")
结果集类似这样:
DPD UK
HAS PACKAGES
ParcelPalletExpressPakFreightParcelFreight
EMS
No predefined packages for EMS.
Estafeta
HAS PACKAGES
ENVELOPEPARCEL
希望有人能让我走上正确的道路,TIA
【问题讨论】:
看起来第一个拼写错误应该是for ul in item.find_all("ul"):
感谢您的回复,我试过了,但随后返回的数据如下: > DPD >DPD 没有预定义的包。 > 联系支持 > 联系销售 > >DPD UK >有包裹 > 联系支持 > 联系销售 > >EMS >没有预定义的 EMS 包裹。 > 与支持人员交谈 > 联系销售人员
【参考方案1】:
这可能会对您有所帮助。我试图提取<ul>
并检查它是否有<li>
标签。只有这样我才会在新行上打印那些 <li>
内容。
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.easypost.com/docs/api#parcels", headers='User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0')
c = r.content
soup= BeautifulSoup(c, "html.parser")
div_tag = soup.find_all("div", "class":lambda L: L and L.startswith("tab-pane predefined-carrier"))
for i in div_tag:
uls = i.find('ul')
# Checks if <ul> is empty. That is no <li> tags present
if uls.text != '':
# Gets all the <li> from <ul>
li_tags = uls.findAll('li')
# print the text of <li> one after other in new line
for item in li_tags:
print(item.text)
print('\n')
Sample Output:
Parcel
Pallet
ExpressPak
FreightParcel
Freight
ENVELOPE
PARCEL
Parcel
Satchel
【讨论】:
太好了,谢谢!它作为一个独立的工作很好,但是当我将它插入我的for item in all:
代码块时,我得到AttributeError: 'int' object has no attribute 'text'
我将代码中的第一行更改为for i in item:
,这是我收到上述错误的时候。如果我将其更改为 `for i in all:, I get all the
` 为 all 变量中的所有项目列出以上是关于当一些 <ul> 没有 <li> 子元素时,从 <ul> 数组中抓取 <li>的主要内容,如果未能解决你的问题,请参考以下文章
Summernote 编辑器验证允许一些 html 标签,如 <b>、<i>、<ul>、<li>,但不允许特殊字符