python网络爬虫
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python网络爬虫相关的知识,希望对你有一定的参考价值。
1.可以通过r.headers
来获取响应头内容
>>>r = requests.get(‘http://www.zhidaow.com‘) >>> r.headers { ‘content-encoding‘: ‘gzip‘, ‘transfer-encoding‘: ‘chunked‘, ‘content-type‘: ‘text/html; charset=utf-8‘; ... }
>>> r.headers[‘Content-Type‘]
‘text/html; charset=utf-8‘
>>> r.headers.get(‘content-type‘)
‘text/html; charset=utf-8‘
2.设置超时时间
>>> requests.get(‘http://github.com‘, timeout=0.001)
3.代理访问
proxies = { "http": "http://10.10.1.10:3128", "https": "http://10.10.1.10:1080", } requests.get("http://www.zhidaow.com", proxies=proxies)
如果代理需要账户和密码,则需这样:
proxies = {
"http": "http://user:[email protected]:3128/",
}
4.不允许重定向的设置
>>>r = requests.get(‘http://www.baidu.com/link?url=QeTRFOS7TuUQRppa0wlTJJr6FfIYI1DJprJukx4Qy0XnsDO_s9baoO8u1wvjxgqN‘, allow_redirects = False)
5. 上传文件
>>> url = ‘http://httpbin.org/post‘
>>> files = {‘file‘: open(‘report.xls‘, ‘rb‘)}
>>> r = requests.post(url, files=files)
>>> r.text
{
...
"files": {
"file": "<censored...binary...data>"
},
...
}
还可以显式地设置文件名:
>>> url = ‘http://httpbin.org/post‘
>>> files = {‘file‘: (‘report.xls‘, open(‘report.xls‘, ‘rb‘))}
>>> r = requests.post(url, files=files)
>>> r.text
{
...
"files": {
"file": "<censored...binary...data>"
},
...
}
如果你想,你也可以发送作为文件来接收的字符串:
>>> url = ‘http://httpbin.org/post‘
>>> files = {‘file‘: (‘report.csv‘, ‘some,data,to,send\nanother,row,to,send\n‘)}
>>> r = requests.post(url, files=files)
>>> r.text
{
...
"files": {
"file": "some,data,to,send\\nanother,row,to,send\\n"
},
...
}
以上是关于python网络爬虫的主要内容,如果未能解决你的问题,请参考以下文章