如何使用python从网站中提取所有链接[重复]

Posted

技术标签:

【中文标题】如何使用python从网站中提取所有链接[重复]【英文标题】:How to extract all links from a website using python [duplicate] 【发布时间】:2021-07-21 10:14:06 【问题描述】:

我编写了一个脚本来从网站中提取链接,效果很好 这是源代码


import requests
from bs4 import BeautifulSoup
Web=requests.get("https://www.google.com/")
soup=BeautifulSoup(Web.text,'lxml')
for link in soup.findAll('a'):
    print(link['href'])

##Out put
https://www.google.com.sa/imghp?hl=ar&tab=wi
https://maps.google.com.sa/maps?hl=ar&tab=wl
https://www.youtube.com/?gl=SA&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://calendar.google.com/calendar?tab=wc
https://www.google.com.sa/intl/ar/about/products?tab=wh
http://www.google.com.sa/history/optout?hl=ar
/preferences?hl=ar
https://accounts.google.com/ServiceLogin?hl=ar&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/search?safe=strict&ie=UTF-8&q=%D9%86%D9%88%D8%B1+%D8%A7%D9%84%D8%B4%D8%B1%D9%8A%D9%81&oi=ddle&ct=174786979&hl=ar&kgmid=/m/0562zv&sa=X&ved=0ahUKEwiq8feoiqDwAhUK8BQKHc7UD7oQPQgD
/advanced_search?hl=ar-SA&authuser=0
https://www.google.com/setprefs?sig=0_mwAqJUgnrqSouOmGk0UvVz7GgkY%3D&hl=en&source=homepage&sa=X&ved=0ahUKEwiq8feoiqDwAhUK8BQKHc7UD7oQ2ZgBCAU
/intl/ar/ads/
http://www.google.com/intl/ar/services/
/intl/ar/about.html
https://www.google.com/setprefdomain?prefdom=SA&prev=https://www.google.com.sa/&sig=K_e_0jdE_IjI-G5o1qMYziPpQwHgs%3D
/intl/ar/policies/privacy/
/intl/ar/policies/terms/

但问题是当我把网站改成https://www.jarir.com/时,就不行了

import requests
from bs4 import BeautifulSoup
Web=requests.get("https://www.jarir.com/")
soup=BeautifulSoup(Web.text,'lxml')
for link in soup.findAll('a'):
    print(link['href'])

#out put
#

输出将是#

【问题讨论】:

你搜索过问题吗?结果如何? 【参考方案1】:

我建议添加一个随机标头函数以避免网站将 python-requests 检测为浏览器/代理。下面的代码按要求返回所有链接。

注意标头的随机化以及此代码如何在 requests.get 方法中使用 headers 参数。

import requests
from bs4 import BeautifulSoup
from random import choice

desktop_agents = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']


def random_headers():
    return 'User-Agent': choice(desktop_agents),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'

Web=requests.get("https://www.jarir.com", headers=random_headers())
soup=BeautifulSoup(Web.text,'lxml')
for link in soup.findAll('a'):
    print(link['href'])

【讨论】:

【参考方案2】:

该网站因 Python 机器人而被阻止:

<h1>Access denied</h1>
<p>This website is using a security service to protect itself from online attacks.</p>

您可以尝试在代码中添加用户代理,如下所示:

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'
Web=requests.get("https://www.jarir.com/", headers=headers)
soup=BeautifulSoup(Web.text)
for link in soup.findAll('a'):
    print(link['href'])

输出类似于:

https://www.jarir.com/wishlist/
https://www.jarir.com/sales/order/history/
https://www.jarir.com/afs/
https://www.jarir.com/contacts/
tel:+966920000089
/cdn-cgi/l/email-protection#6300021106230902110a114d000c0e
https://www.jarir.com/faq/
https://www.jarir.com/warranty_policy/
https://www.jarir.com/return_exchange/
https://www.jarir.com/contacts/
https://www.jarir.com/terms-of-service/
https://www.jarir.com/privacy-policy/
https://www.jarir.com/storelocator/

【讨论】:

以上是关于如何使用python从网站中提取所有链接[重复]的主要内容,如果未能解决你的问题,请参考以下文章

美丽汤中的提取链接[重复]

如何从python中所有列的字符串中提取数字

java爬取网站中所有网页的源代码和链接

如何用最简单的Python爬虫采集整个网站

使用python从网页中提取csv下载链接

使用python如何摆脱从网站上抓取的文本中的尾随空格[重复]