python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!相关的知识,希望对你有一定的参考价值。


通过修改HTTP报文头部,来成功获取网页内容!

 

 

python
import requests
r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
r.status_code
r.encoding

 

 

>>> import requests
>>> r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encoding
ISO-8859-1
>>>
>>> r.encoding = r.apparent_encoding
>>> r.text
<!DOCTYPE html>\\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\\n<!--[if IE 7]>    <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\\n<!--[if IE 8]>    <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\\n<!--[if gt IE 8]><!-->\\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\\n<meta charset="utf-8">\\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\\n<title dir="ltr">Amazon CAPTCHA</title>\\n<meta name="viewport" content="width=device-width">\\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\\n<script>\\n\\nif (true === true) \\n    var ue_t0 = (+ new Date()),\\n        ue_csm = window,\\n        ue = t0: ue_t0, d: function() return (+new Date() - ue_t0); ,\\n        ue_furl = "fls-cn.amazon.cn",\\n        ue_mid = "AAHKV2X7AFYLW",\\n        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\\n        ue_sn = "opfcaptcha.amazon.cn",\\n        ue_id = \\T33DQKKSXC8PZYZ4E5NQ\\;\\n\\n</script>\\n</head>\\n<body>\\n\\n<!--\\n        To discuss automated access to Amazon data please contact api-services-support@amazon.com.\\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\\n-->\\n\\n<!--\\nCorreios.DoNotSend\\n-->\\n\\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\\n\\n    <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\\n\\n        <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\\n\\n        <div class="a-box a-alert a-alert-info a-spacing-base">\\n            <div class="a-box-inner">\\n                <i class="a-icon a-icon-alert"></i>\\n                <h4>请输入您在下方看到的字符</h4>\\n                <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>\\n                </div>\\n            </div>\\n\\n            <div class="a-section">\\n\\n                <div class="a-box a-color-offset-background">\\n                    <div class="a-box-inner a-padding-extra-large">\\n\\n                        <form method="get" action="/errors/validateCaptcha" name="">\\n                            <input type=hidden name="amzn" value="FPQ3hpXcLlqNYaQR2w10gA==" /><input type=hidden name="amzn-r" value="&#047;gp&#047;product&#047;B01M8L5Z3Y" />\\n                            <div class="a-row a-spacing-large">\\n                                <div class="a-box">\\n                                    <div class="a-box-inner">\\n                                        <h4>请输入您在这个图片中看到的字符:</h4>\\n                                        <div class="a-row a-text-center">\\n                                            <img src="https://images-na.ssl-images-amazon.com/captcha/cucusdhr/Captcha_bxyxvfyusz.jpg">\\n                                        </div>\\n                                        <div class="a-row a-spacing-base">\\n                                            <div class="a-row">\\n                                                <div class="a-column a-span6">\\n                                                    <label for="captchacharacters">输入字符</label>\\n                                                </div>\\n                                                <div class="a-column a-span6 a-span-last a-text-right">\\n                                                    <a οnclick="window.location.reload()">换一张图</a>\\n                                                </div>\\n                                            </div>\\n                                            <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\\n                                        </div>\\n                                    </div>\\n                                </div>\\n                            </div>\\n\\n                            <div class="a-section a-spacing-extra-large">\\n\\n                                <div class="a-row">\\n                                    <span class="a-button a-button-primary a-span12">\\n                                        <span class="a-button-inner">\\n                                            <button type="submit" class="a-button-text">继续购物</button>\\n                                        </span>\\n                                    </span>\\n                                </div>\\n\\n                            </div>\\n                        </form>\\n\\n                    </div>\\n                </div>\\n\\n            </div>\\n\\n        </div>\\n\\n        <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\\n\\n        <div class="a-text-center a-spacing-small a-size-mini">\\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用条件</a>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隐私声明</a>\\n        </div>\\n\\n        <div class="a-text-center a-size-mini a-color-secondary">\\n          &copy; 1996-2015, Amazon.com, Inc. or its affiliates\\n          <script>\\n           if (true === true) \\n             document.write(\\<img src="https://fls-cn.amaz\\+\\on.cn/\\+\\1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=1" />\\);\\n           ;\\n          </script>\\n          <noscript>\\n            <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=T33DQKKSXC8PZYZ4E5NQ&js=0" />\\n          </noscript>\\n        </div>\\n    </div>\\n    <script>\\n    if (true === true) \\n        var elem = document.createElement("script");\\n        elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\\n        document.getElementsByTagName(\\head\\)[0].appendChild(elem);\\n    \\n    </script>\\n</body></html>\\n
>>>

 

 

#说明不是网络错误,但是不可以访问!对网络爬虫的限制。一个是http的头,另外一个就是协议!
查看一下请求的头部!
>>> r.request.headers
User-Agent: python-requests/2.20.1, Accept-Encoding: gzip, deflate, Accept: */*, Connection: keep-alive
>>>#User-Agent: python-requests/2.20.1   诚实的告知对方我是“爬虫!”

更改头部信息!
>>> k = user-agent:Mozilla/5.0

#构造键字对!  用来修改头部!  Mozilla/5.0是浏览器的头!就是伪装成一个浏览器去进行访问!
然后!
>>> url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>> r = requests.get(url,headers = k)
>>>
>>> r = requests.get(url,headers = k)
>>> r.request.headers
user-agent: Mozilla/5.0, Accept-Encoding: gzip, deflate, Accept: */*, Connection: keep-alive
>>>#已经变成: Mozilla/5.0

>>> r.status_code
200    #成功!
>>>
>>> r.status_code
200
>>> r.text[:10000]
<!DOCTYPE html>\\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\\n<!--[if IE 7]>    <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\\n<!--[if IE 8]>    <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\\n<!--[if gt IE 8]><!-->\\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\\n<meta charset="utf-8">\\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\\n<title dir="ltr">Amazon CAPTCHA</title>\\n<meta name="viewport" content="width=device-width">\\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\\n<script>\\n\\nif (true === true) \\n    var ue_t0 = (+ new Date()),\\n        ue_csm = window,\\n        ue = t0: ue_t0, d: function() return (+new Date() - ue_t0); ,\\n        ue_furl = "fls-cn.amazon.cn",\\n        ue_mid = "AAHKV2X7AFYLW",\\n        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\\n        ue_sn = "opfcaptcha.amazon.cn",\\n        ue_id = \\Q7EMQ0MRT0KYC536Q05R\\;\\n\\n</script>\\n</head>\\n<body>\\n\\n<!--\\n        To discuss automated access to Amazon data please contact api-services-support@amazon.com.\\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\\n-->\\n\\n<!--\\nCorreios.DoNotSend\\n-->\\n\\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\\n\\n    <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\\n\\n        <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\\n\\n        <div class="a-box a-alert a-alert-info a-spacing-base">\\n            <div class="a-box-inner">\\n                <i class="a-icon a-icon-alert"></i>\\n                <h4>请è¾\\x93å\\x85¥æ\\x82¨å\\x9c¨ä¸\\x8bæ\\x96¹ç\\x9c\\x8bå\\x88°ç\\x9a\\x84å\\xad\\x97符</h4>\\n                <p class="a-last">æ\\x8a±æ\\xad\\x89ï¼\\x8cæ\\x88\\x91们å\\x8fªæ\\x98¯æ\\x83³ç¡®è®¤ä¸\\x80ä¸\\x8bå½\\x93å\\x89\\x8d访é\\x97®è\\x80\\x85并é\\x9d\\x9eè\\x87ªå\\x8a¨ç¨\\x8båº\\x8fã\\x80\\x82为äº\\x86è¾¾å\\x88°æ\\x9c\\x80ä½³æ\\x95\\x88æ\\x9e\\x9cï¼\\x8c请确ä¿\\x9dæ\\x82¨æµ\\x8fè§\\x88å\\x99¨ä¸\\x8aç\\x9a\\x84 Cookie å·²å\\x90¯ç\\x94¨ã\\x80\\x82</p>\\n                </div>\\n            </div>\\n\\n            <div class="a-section">\\n\\n                <div class="a-box a-color-offset-background">\\n                    <div class="a-box-inner a-padding-extra-large">\\n\\n                        <form method="get" action="/errors/validateCaptcha" name="">\\n                            <input type=hidden name="amzn" value="byevvYbW69v2h8EgJ6MuPw==" /><input type=hidden name="amzn-r" value="&#047;gp&#047;product&#047;B01M8L5Z3Y" />\\n                            <div class="a-row a-spacing-large">\\n                                <div class="a-box">\\n                                    <div class="a-box-inner">\\n                                        <h4>请è¾\\x93å\\x85¥æ\\x82¨å\\x9c¨è¿\\x99个å\\x9b¾ç\\x89\\x87ä¸\\xadç\\x9c\\x8bå\\x88°ç\\x9a\\x84å\\xad\\x97符ï¼\\x9a</h4>\\n                                        <div class="a-row a-text-center">\\n                                            <img src="https://images-na.ssl-images-amazon.com/captcha/qamfifum/Captcha_biitgjptru.jpg">\\n                                        </div>\\n                                        <div class="a-row a-spacing-base">\\n                                            <div class="a-row">\\n                                                <div class="a-column a-span6">\\n                                                    <label for="captchacharacters">è¾\\x93å\\x85¥å\\xad\\x97符</label>\\n                                                </div>\\n                                                <div class="a-column a-span6 a-span-last a-text-right">\\n                                                    <a οnclick="window.location.reload()">æ\\x8d¢ä¸\\x80å¼\\xa0å\\x9b¾</a>\\n                                                </div>\\n                                            </div>\\n                                            <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\\n                                        </div>\\n                                    </div>\\n                                </div>\\n                            </div>\\n\\n                            <div class="a-section a-spacing-extra-large">\\n\\n                                <div class="a-row">\\n                                    <span class="a-button a-button-primary a-span12">\\n                                        <span class="a-button-inner">\\n                                            <button type="submit" class="a-button-text">继ç»\\xadè´\\xadç\\x89©</button>\\n                                        </span>\\n                                    </span>\\n                                </div>\\n\\n                            </div>\\n                        </form>\\n\\n                    </div>\\n                </div>\\n\\n            </div>\\n\\n        </div>\\n\\n        <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\\n\\n        <div class="a-text-center a-spacing-small a-size-mini">\\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使ç\\x94¨æ\\x9d¡ä»¶</a>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <span class="a-letter-space"></span>\\n            <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">é\\x9a\\x90ç §\\x81声æ\\x98\\x8e</a>\\n        </div>\\n\\n        <div class="a-text-center a-size-mini a-color-secondary">\\n          &copy; 1996-2015, Amazon.com, Inc. or its affiliates\\n          <script>\\n           if (true === true) \\n             document.write(\\<img src="https://fls-cn.amaz\\+\\on.cn/\\+\\1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=1" />\\);\\n           ;\\n          </script>\\n          <noscript>\\n            <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=Q7EMQ0MRT0KYC536Q05R&js=0" />\\n          </noscript>\\n        </div>\\n    </div>\\n    <script>\\n    if (true === true) \\n        var elem = document.createElement("script");\\n        elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";\\n        document.getElementsByTagName(\\head\\)[0].appendChild(elem);\\n    \\n    </script>\\n</body></html>\\n
>>>
#已经是正常的文本啦!

 

 

就是通过修改HTTP报文头部,来成功获取网页内容!

以上是关于python学习 爬取亚马逊网页,失败后。修改HTTP报文头部后成功!的主要内容,如果未能解决你的问题,请参考以下文章

用python爬取亚马逊物品列表

requests实例2:亚马逊网站商品网页的爬取

python json怎么修改json数据

求助:Python爬虫 点击按钮后的数据如何爬取

nodejs怎么才能用爬虫爬取https网页

python:网络爬虫的学习笔记