使用python BeautifulSoup从balise内部div中废弃链接
Posted
技术标签:
【中文标题】使用python BeautifulSoup从balise内部div中废弃链接【英文标题】:scrap link from balise a inside div using python BeautifulSoup 【发布时间】:2022-01-22 23:12:03 【问题描述】:我想从应答器 div 中的应答器 a 中删除链接
这是我的代码:
from bs4 import BeautifulSoup
import requests
ProductUrl =
url = "https://megapc.tn/shop/ORDINATEURS/PC%20GAMER?selection=true"
header = "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:77.0) Gecko/20190101 Firefox/77.0"
req = requests.get(url, headers=header)
soup = BeautifulSoup(req.content, 'lxml')
#find title of product
showName = soup.find_all('p','class':'title-prod')
#print(showName)
#find price of product
showPrice = soup.find_all('div','class':'new-price')
#print(showPrice)
#find link of product
for urlItem in soup.select("div.card a"):
print(urlItem)
这是我想要的结果:
https://megapc.tn/shop/product/ORDINATEURS/PC%20GAMER/GX-7---RYZEN--3-1200---GTX-1650-D6-OC---8-GB
https://megapc.tn/shop/product/ORDINATEURS/PC%20GAMER/GX-8---i3-10105F---GTX-1650-D6-OC---8GB
https://megapc.tn/shop/product/ORDINATEURS/FULL%20SETUP/GX-9---RYZEN-3-1200---GT-1030-OC---8GB
https://megapc.tn/shop/product/ORDINATEURS/FULL%20SETUP/GX-10---i3-10105F---GT-1030-AERO-OC---8GB
https://megapc.tn/shop/product/ORDINATEURS/PC%20GAMER/pc-gamer-GX-11-GTX-1650-OC-8GB
https://megapc.tn/shop/product/ORDINATEURS/PC%20GAMER/pc-gamer-GX-12-10400F-BOX-GTX-1650-D6-OC
...
enter image description here
任何可能的解决方案??
【问题讨论】:
【参考方案1】:试试这个代码:
tree = html.fromstring(req.content)
linksItem = []
links = []
showlink = soup.findAll('div','class':'card')
for i in showlink:
linksItem.append(i.findAll('a')[0])
lenLinks = len(linksItem)
# Get element using XPath
for i in range(lenLinks):
link = tree.xpath(f'/html/body/app-root/app-content-layout/div/div/div/div/main/app-shop/app-produits-par-sous-categ/section/div/div/div/div/div[2]/div/div/div/div/div/div/div[i]/div/div/a/@href')
if link:
links.append('https://megapc.tn'+link[0])
for url in links:
print(url)
输出
【讨论】:
【参考方案2】:直接调用API,不伤后端服务器。
import requests
from pprint import pp
def main(url):
with requests.Session() as req:
data =
"brand": [],
"categorie":
"titre": "ORDINATEURS"
,
"filscateg":
"titre": "PC GAMER"
,
"pageNumber": 0,
"price":
"$gte": 0,
"$lte": 20000
,
"query": 'null',
"recordByPage": 12,
"valeurAttribute1": []
r = req.post(url, json=data)
for i in r.json():
pp(i)
exit()
main('https://apiclient.mega-pc.net/produit/byPaginationNew')
输出:
'prixEnPromo': 1850,
'_id': '61b85eaee7a18e14c185d20f',
'title_fr': 'GX 8 | i3-10105F | GTX 1650 D6 OC | 8GB',
'notreSelection': True,
'price': 2050,
'devis': False,
'stock': 10,
'new': False,
'sale': True,
'lien': 'GX-8---i3-10105F---GTX-1650-D6-OC---8GB',
'attributes': ['cle': 'PROCESSEUR',
'valeur': 'intel core i3-10105F ',
'enable': True,
'_id': '619cb2e196e44e766cac025c',
'cle': 'FRÉQUENCE PROCESSEUR',
'valeur': '4.40 GHz',
'enable': True,
'_id': '619cb2e196e44e766cac025d',
'cle': 'CHIPSET GRAPHIQUE',
'valeur': 'GTX 1650 D6 OC',
'enable': True,
'_id': '619cb2e196e44e766cac025e',
'cle': 'TAILLE MÉMOIRE VIDÉO',
'valeur': '4 GB',
'enable': True,
'_id': '619cb2e196e44e766cac025f',
'cle': 'CARTE MÈRE',
'valeur': 'GIGABYTE H410M S2H V3',
'enable': True,
'_id': '619cb2e196e44e766cac0260',
'cle': 'BARETTE MÉMOIRE',
'valeur': '8GB DDR4 3000 MHZ',
'enable': True,
'_id': '619cb2e196e44e766cac0261',
'cle': 'NOMBRE DE BARRETTES MÉMOIRE',
'valeur': '1 BARRETTE MEMOIRE',
'enable': True,
'_id': '619cb2e196e44e766cac0262',
'cle': 'TYPE DE STOCKAGE',
'valeur': 'SSD',
'enable': True,
'_id': '619cb2e196e44e766cac0263',
'cle': 'CAPACITÉ DE STOCKAGE',
'valeur': '256 GB',
'enable': True,
'_id': '619cb2e196e44e766cac0264',
'cle': "BLOC D'ALIMENTATION",
'valeur': 'AEROCOOL LUX 550W 80+ BRONZE',
'enable': True,
'_id': '619cb2e196e44e766cac0265',
'cle': 'BOITIER',
'valeur': 'WHITE SHARK CASE GCC-2103 PANZER / 1 FAN RGB',
'enable': True,
'_id': '619cb2e196e44e766cac0266'],
'enArrivage': False,
'discount': 9.75609756097561,
'commande48H': False,
'title': 'GX 8 | i3-10105F | GTX 1650 D6 OC | 8GB',
'marque': '_id': '5fc5ffc00c10517079547a46',
'titre': 'CONFIG PC INTEL',
'description': 'CONFIG PC Intel',
'__v': 0,
'urlPhoto': '/uploads/marque/1606811592727.webp',
'filscateg': '_id': '5ea23237a4815052c4d1a415',
'titre': 'PC GAMER',
'categorie': '5e907aa91c9a7315fc2fc033',
'__v': 0,
'urlPhoto2': '/uploads/souscateg/1623073866853.webp',
'order': 0,
'descriptionSEO': '<p>Achat PC Gamer Tunisie, PC de bureau '
'Gamer sur mesure. Ordinateur Gamer '
'Processeur intel, Ryzen, Carte graphique '
'RTX. Pc gamer tunisie 1000 dt Prix.</p>',
'titreSEO': 'PC Gamer Tunisie - Achat PC Gamer sur mesure - '
'Intel | RYZEN -MEGA PC',
'create_date': '2020-04-24T00:26:31.000Z',
'update_date': '2021-12-15T10:41:08.763Z',
'visible': True,
'gallerie': '_id': '61b8ca88a14a1b547a12db30',
'titre': 'GX 8 | I3-10105F | GTX 1650 D6 OC | 8GB',
'urlPhoto': ['/uploads/gallerie/1640088610360.webp'],
'update_date': '2021-12-21T12:10:11.853Z',
'create_date': '2021-12-14T16:47:04.398Z',
'__v': 2,
'categorie': '_id': '5e907aa91c9a7315fc2fc033',
'order': 0,
'titre': 'ORDINATEURS',
'__v': 0,
'description': '<h1>pc gamer tunisie<br></h1><p>Retrouvez le '
'meilleur <strong>Pc GAMER</strong> en Tunisie '
'sur Megapc.tn . Puissance de calcul, <a '
'href="https://megapc.tn/shop/COMPOSANTS/CARTE%20GRAPHIQUE">carte '
'graphique</a>, ou mémoire vive, sélectionnez le '
'<strong>PC Gaming</strong> adapté à vos '
'besoins.</p><p>Du <strong>PC de bureau</strong> '
"sur mesure à l'ordinateur portable gamer, nos "
'experts alimentent régulièrement les gammes '
"d'ordinateurs en nouveautés pour satisfaire aux "
'exigences des logiciels & derniers jeux '
'Vidéo.</p><p>Trouvez le PC Gamer de vos rêves '
'chez MEGA PC. Config PC Gamer sur mesure; PC '
'Gamer fixe, PC gaming complet!</p>',
'urlPhoto': '/uploads/categorie/1596623778969.jpg',
'urlPhoto2': '/uploads/categorie/1620205162725.webp',
'create_date': '2020-04-10T13:54:49.000Z',
'update_date': '2021-12-16T11:33:51.893Z',
'visible': True,
'nFilsCategs': ['PC GAMER', 'Pc En PROMO']
【讨论】:
真的有用,但我只想要使用 beatifulsoup 的链接而不是 api 你能帮我吗?以上是关于使用python BeautifulSoup从balise内部div中废弃链接的主要内容,如果未能解决你的问题,请参考以下文章
python 使用BeautifulSoup和Python从网页中提取文本
python爬虫从入门到放弃之 BeautifulSoup库的使用
python爬虫从入门到放弃之 BeautifulSoup库的使用