无法从网页中抓取产品标题
Posted
技术标签:
【中文标题】无法从网页中抓取产品标题【英文标题】:Can't scrape product title from a webpage 【发布时间】:2021-08-25 11:45:38 【问题描述】:我正在尝试使用 requests 模块在此 webpage 中抓取可用产品的标题,但即使产品标题在页面源 (ctrl + U
) 中,脚本也总是抛出 AttributeError
。
我已经尝试过 (throws AttributeError
):
import requests
from bs4 import BeautifulSoup
link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers =
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text,"lxml")
try:
product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError: product_title = ""
print(product_title)
预期输出:
Gigabyte GeForce RTX 3070 Aorus Master 8GB OC GPU
如何从该网页上抓取产品标题?
PS 我也尝试过使用这个库 cloudscraper,但没有运气。
编辑:
这是我在运行以下代码时得到的raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url
:
import cfscrape
url = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers =
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
token, agent = cfscrape.get_tokens(url, headers=headers)
print(token, agent)
我知道我可以在 cookie 中使用 cf_clearance
的值来访问页面内容,如果我可以从上述尝试中获取令牌的值。
【问题讨论】:
您是否尝试过print(soup)
检查来源是否与您的浏览器相似并包含您要查找的信息?
我认为你刚刚碰到了 cloudflare ddos 墙
您可能无法使用纯 reqests
绕过 cloudflare 墙。你最好的选择可能是selenium
。
我并不感到惊讶。我打了你的网址,你需要解决混淆的 javascript 挑战才能绕过。 (对我来说)配置和绕过隔离墙需要很多时间。请注意,Cloudflare 带来的“挑战”变化非常规律
cloudscrape 或 cfscrape 都无法解决 Cloudflare javascript 挑战,因为这些包并未公开维护。在 cfscrape 的 issue 部分中,开发人员表示现在有一个付费订阅模式,该模式是活动维护的。
【参考方案1】:
这只是研究的占位符,可能对其他关注此 Cloudflare 绕过问题的人有用。
用例
从使用Cloudflare CAPTCHA or Javascript challenge 的网站抓取信息以增强保护。
Python 请求
使用标准 Python Requests.Get,Cloudflare 服务将返回 403 Forbidden 错误代码。
import requests
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
response = requests.get(URL, headers=headers)
print(f'Status Code: response.status_code')
print(f'Status Code Reason: response.reason')
# output
Status Code: 403
Status Code Reason: Forbidden
如果我们查看 response.headers,我们可以看到 Cloudflare 服务器正在将我们的请求代理到目标 URL。
...continued from the code above
for key, value in response.headers.items():
print(f'KEY NAME: key')
print(f'KEY VALUE: value')
print('-----------------------')
# output
KEY NAME: Date
KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT
-----------------------
KEY NAME: Content-Type
KEY VALUE: text/html; charset=UTF-8
-----------------------
KEY NAME: Transfer-Encoding
KEY VALUE: chunked
-----------------------
KEY NAME: Connection
KEY VALUE: close
-----------------------
KEY NAME: Permissions-Policy
KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
-----------------------
KEY NAME: Cache-Control
KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
-----------------------
KEY NAME: Expires
KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT
-----------------------
KEY NAME: X-Frame-Options
KEY VALUE: SAMEORIGIN
-----------------------
KEY NAME: cf-request-id
KEY VALUE: 0aa7d6c7c4000007ff7201b000000001
-----------------------
KEY NAME: Expect-CT
KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
-----------------------
KEY NAME: Set-Cookie
KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None
-----------------------
KEY NAME: Vary
KEY VALUE: Accept-Encoding
-----------------------
KEY NAME: Server
KEY VALUE: cloudflare
-----------------------
KEY NAME: CF-RAY
KEY VALUE: 65ecc0b9383b07ff-ATL
-----------------------
KEY NAME: Content-Encoding
KEY VALUE: gzip
-----------------------
如果我们查看与 Python 请求 相关的 response.text,我们可以看到与 Cloudflare 保护相关的其他证据。 p>
...continued from the code above
print(response.text)
# output
truncated...
<title>Please Wait... | Cloudflare</title>
<meta name="captcha-bypass" id="captcha-bypass" />
truncated...
<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">
truncated...
<input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">
<input type="hidden" name="cf_captcha_kind" value="h">
<input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">
truncated...
<script type="text/javascript">
//<![CDATA[
(function()
var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
var trkjs = isIE ? new Image() : document.createElement('img');
trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");
trkjs.id = "trk_managed_js";
trkjs.setAttribute("alt", "");
document.body.appendChild(trkjs);
var cpo=document.createElement('script');
cpo.type='text/javascript';
cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";
document.getElementsByTagName('head')[0].appendChild(cpo);
());
//]]>
</script>
以上信息显示,传输到目标 URL 的 Python Requests 被 Cloudflare 服务器拦截,这对请求进行了挑战。在允许初始请求继续之前,必须绕过此质询。
cfscrape 包
OP 表示他们试图使用 cfscrape Python 包从 Cloudflare 服务器获取令牌信息。
标准 cfscrape 请求提供与 Python 请求相同的响应。
import cfscrape
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
scraper = cfscrape.create_scraper(delay=10)
response = scraper.get(URL, headers=headers)
print(f'Status Code: response.status_code')
print(f'Status Code Reason: response.reason')
# output
Status Code: 403
Status Code Reason: Forbidden
cfscrape 包也支持 get_tokens 和 get_cookie_string 函数,但是这两个函数都会产生 403 Forbidden 错误代码。
来自cfscrape源代码:
def is_cloudflare_captcha_challenge(resp):
return (
resp.status_code == 403
and resp.headers.get("Server", "").startswith("cloudflare")
and b"/cdn-cgi/l/chk_captcha" in resp.content
)
# the function above is called from this
def request(self, method, url, *args, **kwargs):
resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)
# Check if Cloudflare captcha challenge is presented
if self.is_cloudflare_captcha_challenge(resp):
self.handle_captcha_challenge(resp, url)
# Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled
if self.is_cloudflare_iuam_challenge(resp):
resp = self.solve_cf_challenge(resp, **kwargs)
return resp
handle_captcha_challenge 函数试图解决 Cloudflare javascript 挑战。 This section of the code 是失败的原因。目前尚不清楚该部分的哪一部分失败,因此需要进行额外的研究和测试。
请注意:根据包的开发者the module is no longer supported。
cloudscraper 包
OP 还表示他们试图使用cloudscraper Python 包从 Cloudflare 服务器获取令牌信息。 cloudscraper 是从 cfscrape 派生出来的,这一点毫无价值, 所以语法是相似的。
cloudscraper 得到与 cfscrape 相同的 403 Forbidden 错误代码。
import cloudscraper
URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
'-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
headers = 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
scraper = cloudscraper.create_scraper()
response = scraper.get(URL)
print(f'Status Code: response.status_code')
print(f'Status Code Reason: response.reason')
# output
Status Code: 403
Status Code Reason: Forbidden
cloudscraper 包也支持 get_tokens 和 get_cookie_string 函数,但是这两个函数都会产生 403 Forbidden 错误代码。
硒包
OP 还表示他们试图使用selenium Python 包。
特别说明:在我的测试过程中,我使用 selenium 和 Google Chrome、Mozilla Firefox 和 Microsoft Edge 的 web 驱动程序>.
在过去 12 个月内,这些选项可用于 selenium 以绕过 Cloudflare 保护。不幸的是,这些选项今天不起作用
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# additional disable-blink-features are available in Chromium source code on Github
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
下面是一个 selenium 代码示例,使用带有上述开关的 Chrome 网络驱动程序。
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)
上面的代码打开了一个浏览器会话,该会话面临 Cloudflare Javascript 挑战。在使用上述开关进行测试期间,这一挑战并未停止。 Cloudflare Ray ID, 在我手动终止会话之前,每个请求的唯一 ID 会旋转多次。
需要seleniumwire才能获取状态码
下面是一个 headless 模式的 Chrome webdriver 会话,它还显示了目标 URL 的 403 Forbidden 错误代码。该会议还表明,hcaptcha.com 反机器人技术现在正在混合使用。
from seleniumwire import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--headless")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)
for request in driver.requests:
print(f'Status Code: request.response')
print(f'Host Name: request.host')
# output
Status Code: 403
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 200
Host Name: www.cclonline.com
-----------------------
Status Code: 302
Host Name: hcaptcha.com
-----------------------
Status Code: 200
Host Name: newassets.hcaptcha.com
-----------------------
driver.quit()
使用 UI 的标准 Chrome 网络驱动程序会话显示带有 “我是人类” 复选框的 iFrame。
如果我手动或使用 selenium 会话单击按钮,系统会提示我使用图片验证码,这会增加绕过 Cloudflare 保护的复杂性。
cf_clearance cookie
当 Cloudflare CAPTCHA 或 Javascript 质询得到解决时,客户端浏览器中会设置一个 cf_clearance cookie。 cf_clearance cookie 的默认生命周期为 30 分钟,但可由 Cloudflare 客户端配置。
如果您在 Google Chrome 浏览器中手动打开 OP 的目标 URL,您可以使用 开发者工具
看到 cf_clearance cookiecf_clearance cookie 的生命周期似乎设置为 60 分钟,具体取决于此会话开始的 UTC 时间和为 cookie 设置的到期日期。
到目前为止,我还没有找到使用 Python 提取此 cookie 的方法。
【讨论】:
为什么这是公认的答案?它不能解决问题。【参考方案2】:请求头中需要的东西!
Cookie“cf_clearance” 用户代理Sample
获取 cookie 的步骤
-
打开 chrome 开发工具
切换到“网络”标签
复制请求标头
import requests
from bs4 import BeautifulSoup
link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
h = '''cookie: cf_clearance=718abb68f064be7612ee987ab9d8bc755016f3c2-1623437208-0-150
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4539.2 Safari/537.36'''
h = dict(l.split(': ') for l in h.split('\n') if ': ' in l)
res = requests.get(link, headers=h)
soup = BeautifulSoup(res.text, "lxml")
try:
product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError:
product_title = ""
print(product_title)
【讨论】:
以上是关于无法从网页中抓取产品标题的主要内容,如果未能解决你的问题,请参考以下文章