python爬虫实践

Posted 2021-02-07 Macaulish

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python爬虫实践相关的知识，希望对你有一定的参考价值。

参考链接

https://blog.csdn.net/u012662731/article/details/78537432
详解 python3 urllib
https://www.jianshu.com/p/2e190438bd9c

需要的包

requests

官方文档：
https://docs.python.org/3/library/urllib.html

urllib.request for opening and reading URLs
- 函数原型：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
- data: 发送数据，
  - params 需要被转码成字节流。而 params 是一个字典
  - 使用 urllib.parse.urlencode() 将字典转化为字符串。\\n
  - 再使用 bytes() 转为字节流。最后使用 urlopen() 发起请求，请求是模拟用 POST 方式提交表单数据。
  - data = bytes(urllib.parse.urlencode(params), encoding=\'utf8\')
  - response = urllib.request.urlopen(url, data=data)
  - 使用 data 参数，请求方式变成以 POST 方式提交表单。使用标准格式是application/x-www-form-urlencoded
- timeout 参数是用于设置请求超时时间。单位是秒。
- cafile和capath代表 CA 证书和 CA 证书的路径。如果使用HTTPS则需要用到。
- context参数必须是ssl.SSLContext类型，用来指定SSL设置
- cadefault参数已经被弃用，可以不用管了。
- 该方法也可以单独传入urllib.request.Request对象
- 该函数返回结果是一个http.client.HTTPResponse对象。
- 函数原型：urllib.request.Request(url, data=None, headers={},origin_req_host=None,unverifiable=False, method=None)
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files

以上是关于python爬虫实践的主要内容，如果未能解决你的问题，请参考以下文章