python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!相关的知识,希望对你有一定的参考价值。
通过修改HTTP报文头部,来成功获取网页内容!
python
import requests
r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
r.status_code
r.encoding
>>> import requests
>>> r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encoding
ISO-8859-1
>>>
>>> r.encoding = r.apparent_encoding
>>> r.text
<!DOCTYPE html>\\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\\n<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\\n<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\\n<!--[if gt IE 8]><!-->\\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\\n<meta charset="utf-8">\\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\\n<title dir="ltr">Amazon CAPTCHA</title>\\n<meta name="viewport" content="width=device-width">\\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\\n<script>\\n\\nif (true === true) \\n var ue_t0 = (+ new Date()),\\n ue_csm = window,\\n ue = t0: ue_t0, d: function() return (+new Date() - ue_t0); ,\\n ue_furl = "fls-cn.amazon.cn",\\n ue_mid = "AAHKV2X7AFYLW",\\n ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\\n ue_sn = "opfcaptcha.amazon.cn",\\n ue_id = \\T33DQKKSXC8PZYZ4E5NQ\\;\\n\\n</script>\\n</head>\\n<body>\\n\\n<!--\\n To discuss automated access to Amazon data please contact api-services-support@amazon.com.\\n For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\\n-->\\n\\n<!--\\nCorreios.DoNotSend\\n-->\\n\\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\\n\\n <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\\n\\n <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\\n\\n <div class="a-box a-alert a-alert-info a-spacing-base">\\n <div class="a-box-inner">\\n <i class="a-icon a-icon-alert"></i>\\n <h4>请输入您在下方看到的字符</h4>\\n <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>\\n </div>\\n </div>\\n\\n <div class="a-section">\\n\\n <div class="a-box a-color-offset-background">\\n <div class="a-box-inner a-padding-extra-large">\\n\\n <form method="get" action="/errors/validateCaptcha" name="">\\n <input type=hidden name="amzn" value="FPQ3hpXcLlqNYaQR2w10gA==" /><input type=hidden name="amzn-r" value="/gp/product/B01M8L5Z3Y" />\\n <div class="a-row a-spacing-large">\\n <div class="a-box">\\n <div class="a-box-inner">\\n <h4>请输入您在这个图片中看到的字符:</h4>\\n <div class="a-row a-text-center">\\n <img src="https://images-na.ssl-images-amazon.com/captcha/cucusdhr/Captcha_bxyxvfyusz.jpg">\\n </div>\\n <div class="a-row a-spacing-base">\\n <div class="a-row">\\n <div class="a-column a-span6">\\n <label for="captchacharacters">输入字符</label>\\n </div>\\n <div class="a-column a-span6 a-span-last a-text-right">\\n <a οnclick="window.location.reload()">换一张图</a>\\n </div>\\n </div>\\n <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\\n </div>\\n </div>\\n </div>\\n </div>\\n\\n <div class="a-section a-spacing-extra-large">\\n\\n <div class="a-row">\\n <span class="a-button a-button-primary a-span12">\\n <span class="a-button-inner">\\n <button type="submit" class="a-button-text">继续购物</button>\\n </span>\\n </span>\\n </div>\\n\\n </div>\\n </form>\\n\\n </div>\\n </div>\\n\\n </div>\\n\\n </div>\\n\\n <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\\n\\n <div class="a-text-center a-spacing-small a-size-mini">\\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用条件</a>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隐私声明</a>\\n </div>\\n\\n <div class="a-text-center a-size-mini a-color-secondary">\\n © 1996-2015, Amazon.com, Inc. or its affiliates\\n <script>\\n if (true === true) \\n document.write(\\<img src="https://fls-cn.amaz\\+\\on.cn/\\+\\1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=1" />\\);\\n ;\\n </script>\\n <noscript>\\n <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=0" />\\n </noscript>\\n </div>\\n </div>\\n <script>\\n if (true === true) \\n var elem = document.createElement("script");\\n elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\\n document.getElementsByTagName(\\head\\)[0].appendChild(elem);\\n \\n </script>\\n</body></html>\\n
>>>
#说明不是网络错误,但是不可以访问!对网络爬虫的限制。一个是http的头,另外一个就是协议!
查看一下请求的头部!
>>> r.request.headers
User-Agent: python-requests/2.20.1, Accept-Encoding: gzip, deflate, Accept: */*, Connection: keep-alive
>>>#User-Agent: python-requests/2.20.1 诚实的告知对方我是“爬虫!”
更改头部信息!
>>> k = user-agent:Mozilla/5.0
#构造键字对! 用来修改头部! Mozilla/5.0是浏览器的头!就是伪装成一个浏览器去进行访问!
然后!
>>> url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>> r = requests.get(url,headers = k)
>>>
>>> r = requests.get(url,headers = k)
>>> r.request.headers
user-agent: Mozilla/5.0, Accept-Encoding: gzip, deflate, Accept: */*, Connection: keep-alive
>>>#已经变成: Mozilla/5.0
>>> r.status_code
200 #成功!
>>>
>>> r.status_code
200
>>> r.text[:10000]
<!DOCTYPE html>\\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\\n<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\\n<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\\n<!--[if gt IE 8]><!-->\\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\\n<meta charset="utf-8">\\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\\n<title dir="ltr">Amazon CAPTCHA</title>\\n<meta name="viewport" content="width=device-width">\\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\\n<script>\\n\\nif (true === true) \\n var ue_t0 = (+ new Date()),\\n ue_csm = window,\\n ue = t0: ue_t0, d: function() return (+new Date() - ue_t0); ,\\n ue_furl = "fls-cn.amazon.cn",\\n ue_mid = "AAHKV2X7AFYLW",\\n ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\\n ue_sn = "opfcaptcha.amazon.cn",\\n ue_id = \\Q7EMQ0MRT0KYC536Q05R\\;\\n\\n</script>\\n</head>\\n<body>\\n\\n<!--\\n To discuss automated access to Amazon data please contact api-services-support@amazon.com.\\n For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\\n-->\\n\\n<!--\\nCorreios.DoNotSend\\n-->\\n\\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\\n\\n <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\\n\\n <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\\n\\n <div class="a-box a-alert a-alert-info a-spacing-base">\\n <div class="a-box-inner">\\n <i class="a-icon a-icon-alert"></i>\\n <h4>请è¾\\x93å\\x85¥æ\\x82¨å\\x9c¨ä¸\\x8bæ\\x96¹ç\\x9c\\x8bå\\x88°ç\\x9a\\x84å\\xad\\x97符</h4>\\n <p class="a-last">æ\\x8a±æ\\xad\\x89ï¼\\x8cæ\\x88\\x91们å\\x8fªæ\\x98¯æ\\x83³ç¡®è®¤ä¸\\x80ä¸\\x8bå½\\x93å\\x89\\x8d访é\\x97®è\\x80\\x85并é\\x9d\\x9eè\\x87ªå\\x8a¨ç¨\\x8båº\\x8fã\\x80\\x82为äº\\x86è¾¾å\\x88°æ\\x9c\\x80ä½³æ\\x95\\x88æ\\x9e\\x9cï¼\\x8c请确ä¿\\x9dæ\\x82¨æµ\\x8fè§\\x88å\\x99¨ä¸\\x8aç\\x9a\\x84 Cookie å·²å\\x90¯ç\\x94¨ã\\x80\\x82</p>\\n </div>\\n </div>\\n\\n <div class="a-section">\\n\\n <div class="a-box a-color-offset-background">\\n <div class="a-box-inner a-padding-extra-large">\\n\\n <form method="get" action="/errors/validateCaptcha" name="">\\n <input type=hidden name="amzn" value="byevvYbW69v2h8EgJ6MuPw==" /><input type=hidden name="amzn-r" value="/gp/product/B01M8L5Z3Y" />\\n <div class="a-row a-spacing-large">\\n <div class="a-box">\\n <div class="a-box-inner">\\n <h4>请è¾\\x93å\\x85¥æ\\x82¨å\\x9c¨è¿\\x99个å\\x9b¾ç\\x89\\x87ä¸\\xadç\\x9c\\x8bå\\x88°ç\\x9a\\x84å\\xad\\x97符ï¼\\x9a</h4>\\n <div class="a-row a-text-center">\\n <img src="https://images-na.ssl-images-amazon.com/captcha/qamfifum/Captcha_biitgjptru.jpg">\\n </div>\\n <div class="a-row a-spacing-base">\\n <div class="a-row">\\n <div class="a-column a-span6">\\n <label for="captchacharacters">è¾\\x93å\\x85¥å\\xad\\x97符</label>\\n </div>\\n <div class="a-column a-span6 a-span-last a-text-right">\\n <a οnclick="window.location.reload()">æ\\x8d¢ä¸\\x80å¼\\xa0å\\x9b¾</a>\\n </div>\\n </div>\\n <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\\n </div>\\n </div>\\n </div>\\n </div>\\n\\n <div class="a-section a-spacing-extra-large">\\n\\n <div class="a-row">\\n <span class="a-button a-button-primary a-span12">\\n <span class="a-button-inner">\\n <button type="submit" class="a-button-text">继ç»\\xadè´\\xadç\\x89©</button>\\n </span>\\n </span>\\n </div>\\n\\n </div>\\n </form>\\n\\n </div>\\n </div>\\n\\n </div>\\n\\n </div>\\n\\n <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\\n\\n <div class="a-text-center a-spacing-small a-size-mini">\\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使ç\\x94¨æ\\x9d¡ä»¶</a>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <span class="a-letter-space"></span>\\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">é\\x9a\\x90ç §\\x81声æ\\x98\\x8e</a>\\n </div>\\n\\n <div class="a-text-center a-size-mini a-color-secondary">\\n © 1996-2015, Amazon.com, Inc. or its affiliates\\n <script>\\n if (true === true) \\n document.write(\\<img src="https://fls-cn.amaz\\+\\on.cn/\\+\\1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=1" />\\);\\n ;\\n </script>\\n <noscript>\\n <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=0" />\\n </noscript>\\n </div>\\n </div>\\n <script>\\n if (true === true) \\n var elem = document.createElement("script");\\n elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\\n document.getElementsByTagName(\\head\\)[0].appendChild(elem);\\n \\n </script>\\n</body></html>\\n
>>>
#已经是正常的文本啦!
就是通过修改HTTP报文头部,来成功获取网页内容!
以上是关于python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!的主要内容,如果未能解决你的问题,请参考以下文章