抓取 html 链接 Python

Posted 2023-02-15

技术标签:

【中文标题】抓取 html 链接 Python【英文标题】：Scrape html links Python 【发布时间】：2022-01-04 20:13:03 【问题描述】：

大家好，我正在尝试使用 python 获取所有 href 链接：

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/84.0.4147.105 Safari/537.36'

#Collecting links on rappel.gouv
def get_url(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract(soup):
    results = soup.find_all('div', 'class' : 'product-content')
    for item in results:
        item.find('a', 'class' : 'product-link').text.replace('','').strip()
        links = url + item.find('a', 'class' : 'product-link')['href']

    return links

soup = get_url(url)
print(extract(soup))

我应该得到 10 个 html 链接，如下所示：

https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4572/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4573/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4575/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4569/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4565/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4568/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4570/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4567/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne

当我将print 写入如下代码时，它实际上可以工作：

def extract(soup):
    results = soup.find_all('div', 'class' : 'product-content')
    for item in results:
        item.find('a', 'class' : 'product-link').text.replace('','').strip()
        links = url + item.find('a', 'class' : 'product-link')['href']
        print(links)

    return

但我认为我从这个请求中获得的所有链接都将它们放入一个循环中，这样我将从这 10 个页面中的每一个页面中获取数据并将它们存储在数据库中（所以这意味着有行代码要编写在def extract(soup)之后来。

我尝试通过许多教程来理解，我得到了一个 html 或 none

【问题讨论】：

【参考方案1】：

您只需要构建一个链接列表，在您的代码中，变量链接只会在每次循环中重置。试试这个：

def extract(soup):
    results = soup.find_all('div', 'class' : 'product-content')
    links = []
    for item in results:
        item.find('a', 'class' : 'product-link').text.replace('','').strip()
        links.append(url + item.find('a', 'class' : 'product-link')['href'])


    return links

在函数后打印主代码中的每个链接：

soup = get_url(url)
linklist = extract(soup)
for url in linklist:
    print(url)

【讨论】：

谢谢 :) 但我也这样做了，我得到如下结果：['https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne', ... 'https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne'] 但我想知道......假设我将此输出命名为 url_data = extract(soup)，我'我打算像这样request.get(url_data) 实现url_data，然后我使用bs4，为每个页面提取数据，你认为它会工作吗？因为我担心这样的错误 requests.exceptions.InvalidSchema: No connection adapters were found for "['rappel.conso.gouv.fr']" 您可以通过索引访问列表中的链接：soup = get_url(url) linklist = extract(soup) print(linklist[0]) print(linklist[1]) 当然，您可以循环遍历此列表。 for url in linklist: print(url) 非常感谢！！非常感谢，也感谢其他所有人:) 还有一件事：如果您需要将起始 url 保存在变量 url 中，最好在最后一个循环中设置一些不同的变量名称 :)【参考方案2】：

您的links 变量正在for 循环内重写。

您可以在循环之前创建一个空列表，然后在每次迭代时附加 URL。

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'

#Collecting links on rappel.gouv
def get_url(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract(soup):
    results = soup.find_all('div', 'class' : 'product-content')
    links = []
    for item in results:
        item.find('a', 'class' : 'product-link').text.replace('','').strip()
        links.append(url + item.find('a', 'class' : 'product-link')['href'])

    return links

soup = get_url(url)
print(extract(soup))

【讨论】：

是的，我也这样做了，我得到的结果如下：['https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne', ... 'https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne'] 但我的问题是......假设我将此输出命名为url_data = extract(soup)，我将像这样实现 url_data这个request.get(url_data) 然后我使用bs4，为每个页面提取数据，你认为它会工作吗？因为我害怕这样的错误requests.exceptions.InvalidSchema: No connection adapters were found for "['https://rappel.conso.gouv.fr']"【参考方案3】：

要使用页面中的链接来迭代每个产品详细信息页面，请收集列表中的链接并从函数中返回。

尝试将您的函数命名为更像它们返回的 get_url() 更像是 get_soup(),...

示例

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'

def get_soup(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_product_urls(url):
    links = [url+x['href'] for x in get_soup(url).select('a.product-link')]
    return links

def extract_product_details(url):
    soup = get_soup(url)
    items = 

    for x in soup.select('.product-desc li'):
        content = x.get_text('|', strip=True).split('|')
        items[content[0]]=content[1]

    return items

data = []

for link in extract_product_urls(url):
    data.append(extract_product_details(link))

data

输出

['Réf. Fiche\xa0:': '2021-11-0273',
  '№ de Version\xa0:': '1',
  'Origine de la fiche\xa0:': 'PLACE DU MARCHE PLACE DU MARCHE',
  'Nature juridique du rappel\xa0:': 'Volontaire',
  'Catégorie de produit': 'Alimentation',
  'Sous-catégorie de produit': 'Lait et produits laitiers',
  'Nom de la marque du produit': 'Toupargel',
  'Noms des modèles ou références': 'BATONNETS GEANTS VANILLE AMANDES',
  'Identification des produits': 'GTIN',
  'Conditionnements': '292G',
  'Date début/Fin de commercialisation': 'Du\r\n                            11/07/2019\r\n                            au\r\n                            18/09/2021',
  'Température de conservation': 'Produit à conserver au congélateur',
  'Marque de salubrité': 'EMB 35360C',
  'Zone géographique de vente': 'France entière',
  'Distributeurs': 'PLACE DU MARCHE',
  'Motif du rappel': 'Nous tenons à vous informer, que suite à une alerte européenne concernant la présence potentielle d’oxyde d’éthylène à une teneur supérieure à la limite autorisée, et comme un grand nombre d’acteurs de la distribution, nous devons procéder au rappel',
  'Risques encourus par le consommateur': 'Autres contaminants chimiques',
  'Conduite à tenir par le consommateur': 'Ne plus consommer',
  'Numéro de contact': '0805805910',
  'Modalités de compensation': 'Remboursement',
  'Date de fin de la procédure de rappel': 'samedi 26 février 2022',
 'Réf. Fiche\xa0:': '2021-11-0274',
  '№ de Version\xa0:': '1',
  'Origine de la fiche\xa0:': 'PLACE DU MARCHE PLACE DU MARCHE',
  'Nature juridique du rappel\xa0:': 'Volontaire',
  'Catégorie de produit': 'Alimentation',
  'Sous-catégorie de produit': 'Lait et produits laitiers',
  'Nom de la marque du produit': 'Toupargel',
  'Noms des modèles ou références': 'CREME GLACEE NOUGAT',
  'Identification des produits': 'GTIN',
  'Conditionnements': '469G',
  'Date début/Fin de commercialisation': 'Du\r\n                            28/06/2019\r\n                            au\r\n                            10/10/2021',
  'Température de conservation': 'Produit à conserver au congélateur',
  'Marque de salubrité': 'EMB 35360C',
  'Zone géographique de vente': 'France entière',
  'Distributeurs': 'PLACE DU MARCHE',
  'Motif du rappel': 'Nous tenons à vous informer, que suite à une alerte européenne concernant la présence potentielle d’oxyde d’éthylène à une teneur supérieure à la limite autorisée, et comme un grand nombre d’acteurs de la distribution, nous devons procéder au rappel',
  'Risques encourus par le consommateur': 'Autres contaminants chimiques',
  'Conduite à tenir par le consommateur': 'Ne plus consommer',
  'Numéro de contact': '0805805910',
  'Modalités de compensation': 'Remboursement',
  'Date de fin de la procédure de rappel': 'samedi 26 février 2022',...]

【讨论】：

非常感谢，少了几行，同样有效：D

以上是关于抓取 html 链接 Python的主要内容，如果未能解决你的问题，请参考以下文章