python爬取网页遇到521的处理方法
Posted 银色之刃
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬取网页遇到521的处理方法相关的知识,希望对你有一定的参考价值。
在网页中爬取数据时遇到status code: 521。参考:
https://blog.csdn.net/fm345689/article/details/84980340
https://zhuanlan.zhihu.com/p/25957793
导入execjs库。PyV8仅支持到Python 2.7,不支持Python 3.7。
1 # -*- coding: utf-8 -*- 2 3 import execjs 4 import re 5 import requests_html 6 7 8 def parse_js(html): 9 # 提取js加密函数 10 js_string = re.search(‘(function.*)</script>‘, html).group(1) 11 # 修改js数据,将eval改为return 12 js_string = js_string.replace(‘eval("qo=eval;qo(po);")‘, ‘return po‘) 13 # 提取js函数的参数 14 # window.onload = setTimeout("hu(60)", 200) 15 js_func, js_args = re.search(r‘setTimeout("(.*?)((.*?))",sd+)‘, html).group(1, 2) 16 # 执行js获取cookie 17 js_result = execjs.compile(js_string).call(js_func, js_args) 18 # 提取 cookie 19 # document.cookie=‘_ydclearance=f530d15ec2689d8c524213bf-aacb-4966-bedd-cc982c6bb4ea-1549534240; expires=Thu, 07-Feb-19 10:10:40 GMT; domain=.66ip.cn; path=/‘; window.document.location=document.URL 20 cookie_str = re.search("cookie=\‘(.*?);", js_result).group(1) 21 return cookie_str 22 23 24 if __name__ == ‘__main__‘: 25 session = requests_html.HTMLSession() 26 url = ‘http://www.66ip.cn/1.html‘ 27 headers = { 28 ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36‘ 29 } 30 response = session.get(url, headers=headers) 31 print(response.status_code) 32 cookies = response.cookies 33 cookies_text = ‘;‘.join([‘=‘.join(item) for item in cookies.items()]) 34 if response.status_code == 521: 35 headers[‘cookie‘] = parse_js(response.text) 36 response = session.get(url, headers=headers) 37 print(response.status_code) 38 else: 39 pass
以上是关于python爬取网页遇到521的处理方法的主要内容,如果未能解决你的问题,请参考以下文章
想用python爬取网页上的图片,但无法用select()方法定位图片的源地址?