爬虫工具篇汇总
Posted carlous
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫工具篇汇总相关的知识,希望对你有一定的参考价值。
Requests
Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。
功能特性:
- Keep-Alive & 连接池
- 国际化域名和 URL
- 带持久 Cookie 的会话
- 浏览器式的 SSL 认证
- 自动内容解码
- 基本/摘要式的身份认证
- 优雅的 key/value Cookie
- 自动解压
- Unicode 响应体
- HTTP(S) 代理支持
- 文件分块上传
- 流下载
- 连接超时
- 分块请求
- 支持
.netrc
1 from . import sessions 2 3 # 方法 4 def request(method, url, **kwargs): 5 “”很长的一段注释“” 6 with sessions.Session() as session: 7 return session.request(method=method, url=url, **kwargs) 8 9 # 下面的方法都基于request实现 10 def get(url, params=None, **kwargs): 11 pass 12 13 def options(url, **kwargs): 14 pass 15 16 def head(url, **kwargs): 17 pass 18 19 def post(url, data=None, json=None, **kwargs): 20 pass 21 22 def put(url, data=None, **kwargs): 23 pass 24 25 def patch(url, data=None, **kwargs): 26 pass 27 28 def delete(url, **kwargs): 29 pass
1 def request(method, url, **kwargs): 2 """Constructs and sends a :class:`Request <Request>`. 3 4 :param method: method for the new :class:`Request` object. 5 :param url: URL for the new :class:`Request` object. 6 :param params: (optional) Dictionary, list of tuples or bytes to send 7 in the body of the :class:`Request`. 8 :param data: (optional) Dictionary, list of tuples, bytes, or file-like 9 object to send in the body of the :class:`Request`. 10 :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`. 11 :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. 12 :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. 13 :param files: (optional) Dictionary of ``‘name‘: file-like-objects`` (or ``{‘name‘: file-tuple}``) for multipart encoding upload. 14 ``file-tuple`` can be a 2-tuple ``(‘filename‘, fileobj)``, 3-tuple ``(‘filename‘, fileobj, ‘content_type‘)`` 15 or a 4-tuple ``(‘filename‘, fileobj, ‘content_type‘, custom_headers)``, where ``‘content-type‘`` is a string 16 defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers 17 to add for the file. 18 :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. 19 :param timeout: (optional) How many seconds to wait for the server to send data 20 before giving up, as a float, or a :ref:`(connect timeout, read 21 timeout) <timeouts>` tuple. 22 :type timeout: float or tuple 23 :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``. 24 :type allow_redirects: bool 25 :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. 26 :param verify: (optional) Either a boolean, in which case it controls whether we verify 27 the server‘s TLS certificate, or a string, in which case it must be a path 28 to a CA bundle to use. Defaults to ``True``. 29 :param stream: (optional) if ``False``, the response content will be immediately downloaded. 30 :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert‘, ‘key‘) pair. 31 :return: :class:`Response <Response>` object 32 :rtype: requests.Response 33 34 Usage:: 35 36 >>> import requests 37 >>> req = requests.request(‘GET‘, ‘https://httpbin.org/get‘) 38 <Response [200]> 39 """ 40 41 # By using the ‘with‘ statement we are sure the session is closed, thus we 42 # avoid leaving sockets open which can trigger a ResourceWarning in some 43 # cases, and look like a memory leak in others. 44 with sessions.Session() as session: 45 return session.request(method=method, url=url, **kwargs)
1 def param_method_url(): 2 # requests.request(method=‘get‘, url=‘http://127.0.0.1:8000/test/‘) 3 # requests.request(method=‘post‘, url=‘http://127.0.0.1:8000/test/‘) 4 pass 5 6 7 def param_param(): 8 # - 可以是字典 9 # - 可以是字符串 10 # - 可以是字节(ascii编码以内) 11 12 # requests.request(method=‘get‘, 13 # url=‘http://127.0.0.1:8000/test/‘, 14 # params={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘}) 15 16 # requests.request(method=‘get‘, 17 # url=‘http://127.0.0.1:8000/test/‘, 18 # params="k1=v1&k2=水电费&k3=v3&k3=vv3") 19 20 # requests.request(method=‘get‘, 21 # url=‘http://127.0.0.1:8000/test/‘, 22 # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding=‘utf8‘)) 23 24 # 错误 25 # requests.request(method=‘get‘, 26 # url=‘http://127.0.0.1:8000/test/‘, 27 # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding=‘utf8‘)) 28 pass 29 30 31 def param_data(): 32 # 可以是字典 33 # 可以是字符串 34 # 可以是字节 35 # 可以是文件对象 36 37 # requests.request(method=‘POST‘, 38 # url=‘http://127.0.0.1:8000/test/‘, 39 # data={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘}) 40 41 # requests.request(method=‘POST‘, 42 # url=‘http://127.0.0.1:8000/test/‘, 43 # data="k1=v1; k2=v2; k3=v3; k3=v4" 44 # ) 45 46 # requests.request(method=‘POST‘, 47 # url=‘http://127.0.0.1:8000/test/‘, 48 # data="k1=v1;k2=v2;k3=v3;k3=v4", 49 # headers={‘Content-Type‘: ‘application/x-www-form-urlencoded‘} 50 # ) 51 52 # requests.request(method=‘POST‘, 53 # url=‘http://127.0.0.1:8000/test/‘, 54 # data=open(‘data_file.py‘, mode=‘r‘, encoding=‘utf-8‘), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4 55 # headers={‘Content-Type‘: ‘application/x-www-form-urlencoded‘} 56 # ) 57 pass 58 59 60 def param_json(): 61 # 将json中对应的数据进行序列化成一个字符串,json.dumps(...) 62 # 然后发送到服务器端的body中,并且Content-Type是 {‘Content-Type‘: ‘application/json‘} 63 requests.request(method=‘POST‘, 64 url=‘http://127.0.0.1:8000/test/‘, 65 json={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘}) 66 67 68 def param_headers(): 69 # 发送请求头到服务器端 70 requests.request(method=‘POST‘, 71 url=‘http://127.0.0.1:8000/test/‘, 72 json={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘}, 73 headers={‘Content-Type‘: ‘application/x-www-form-urlencoded‘} 74 ) 75 76 77 def param_cookies(): 78 # 发送Cookie到服务器端 79 requests.request(method=‘POST‘, 80 url=‘http://127.0.0.1:8000/test/‘, 81 data={‘k1‘: ‘v1‘, ‘k2‘: ‘v2‘}, 82 cookies={‘cook1‘: ‘value1‘}, 83 ) 84 # 也可以使用CookieJar(字典形式就是在此基础上封装) 85 from http.cookiejar import CookieJar 86 from http.cookiejar import Cookie 87 88 obj = CookieJar() 89 obj.set_cookie(Cookie(version=0, name=‘c1‘, value=‘v1‘, port=None, domain=‘‘, path=‘/‘, secure=False, expires=None, 90 discard=True, comment=None, comment_url=None, rest={‘HttpOnly‘: None}, rfc2109=False, 91 port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False) 92 ) 93 requests.request(method=‘POST‘, 94 url=‘http://127.0.0.1:8000/test/‘, 95 data={‘k1‘: ‘v1‘, ‘k2‘: ‘v2‘}, 96 cookies=obj) 97 98 99 def param_files(): 100 # 发送文件 101 # file_dict = { 102 # ‘f1‘: open(‘readme‘, ‘rb‘) 103 # } 104 # requests.request(method=‘POST‘, 105 # url=‘http://127.0.0.1:8000/test/‘, 106 # files=file_dict) 107 108 # 发送文件,定制文件名 109 # file_dict = { 110 # ‘f1‘: (‘test.txt‘, open(‘readme‘, ‘rb‘)) 111 # } 112 # requests.request(method=‘POST‘, 113 # url=‘http://127.0.0.1:8000/test/‘, 114 # files=file_dict) 115 116 # 发送文件,定制文件名 117 # file_dict = { 118 # ‘f1‘: (‘test.txt‘, "hahsfaksfa9kasdjflaksdjf") 119 # } 120 # requests.request(method=‘POST‘, 121 # url=‘http://127.0.0.1:8000/test/‘, 122 # files=file_dict) 123 124 # 发送文件,定制文件名 125 # file_dict = { 126 # ‘f1‘: (‘test.txt‘, "hahsfaksfa9kasdjflaksdjf", ‘application/text‘, {‘k1‘: ‘0‘}) 127 # } 128 # requests.request(method=‘POST‘, 129 # url=‘http://127.0.0.1:8000/test/‘, 130 # files=file_dict) 131 132 pass 133 134 135 def param_auth(): 136 from requests.auth import HTTPBasicAuth, HTTPDigestAuth 137 138 ret = requests.get(‘https://api.github.com/user‘, auth=HTTPBasicAuth(‘wupeiqi‘, ‘sdfasdfasdf‘)) 139 print(ret.text) 140 141 # ret = requests.get(‘http://192.168.1.1‘, 142 # auth=HTTPBasicAuth(‘admin‘, ‘admin‘)) 143 # ret.encoding = ‘gbk‘ 144 # print(ret.text) 145 146 # ret = requests.get(‘http://httpbin.org/digest-auth/auth/user/pass‘, auth=HTTPDigestAuth(‘user‘, ‘pass‘)) 147 # print(ret) 148 # 149 150 151 def param_timeout(): 152 # ret = requests.get(‘http://google.com/‘, timeout=1) 153 # print(ret) 154 155 # ret = requests.get(‘http://google.com/‘, timeout=(5, 1)) 156 # print(ret) 157 pass 158 159 160 def param_allow_redirects(): 161 ret = requests.get(‘http://127.0.0.1:8000/test/‘, allow_redirects=False) 162 print(ret.text) 163 164 165 def param_proxies(): 166 # proxies = { 167 # "http": "61.172.249.96:80", 168 # "https": "http://61.185.219.126:3128", 169 # } 170 171 # proxies = {‘http://10.20.1.128‘: ‘http://10.10.1.10:5323‘} 172 173 # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) 174 # print(ret.headers) 175 176 177 # from requests.auth import HTTPProxyAuth 178 # 179 # proxyDict = { 180 # ‘http‘: ‘77.75.105.165‘, 181 # ‘https‘: ‘77.75.105.165‘ 182 # } 183 # auth = HTTPProxyAuth(‘username‘, ‘mypassword‘) 184 # 185 # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) 186 # print(r.text) 187 188 pass 189 190 191 def param_stream(): 192 ret = requests.get(‘http://127.0.0.1:8000/test/‘, stream=True) 193 print(ret.content) 194 ret.close() 195 196 # from contextlib import closing 197 # with closing(requests.get(‘http://httpbin.org/get‘, stream=True)) as r: 198 # # 在此处理响应。 199 # for i in r.iter_content(): 200 # print(i) 201 202 203 def requests_session(): 204 import requests 205 206 session = requests.Session() 207 208 ### 1、首先登陆任何页面,获取cookie 209 210 i1 = session.get(url="http://dig.chouti.com/help/service") 211 212 ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 213 i2 = session.post( 214 url="http://dig.chouti.com/login", 215 data={ 216 ‘phone‘: "8615131255089", 217 ‘password‘: "xxxxxx", 218 ‘oneMonth‘: "" 219 } 220 ) 221 222 i3 = session.post( 223 url="http://dig.chouti.com/link/vote?linksId=8589623", 224 ) 225 print(i3.text)
BeautifulSoup
Beautiful Soup 是一个可以从html或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
pip3 install beautifulsoup4 # 注意不要安装错了
1 from bs4 import BeautifulSoup 2 3 html_doc = """ 4 <html><head><title>The Dormouse‘s story</title></head> 5 <body> 6 asdf 7 <div class="title"> 8 <b>The Dormouse‘s story总共</b> 9 <h1>f</h1> 10 </div> 11 <div class="story">Once upon a time there were three little sisters; and their names were 12 <a class="sister0" id="link1">Els<span>f</span>ie</a>, 13 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 14 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 15 and they lived at the bottom of a well.</div> 16 ad<br/>sf 17 <p class="story">...</p> 18 </body> 19 </html> 20 """ 21 22 soup = BeautifulSoup(html_doc, features="lxml") 23 # 找到第一个a标签 24 tag1 = soup.find(name=‘a‘) 25 # 找到所有的a标签 26 tag2 = soup.find_all(name=‘a‘) 27 # 找到id=link2的标签 28 tag3 = soup.select(‘#link2‘)
参考:
1.http://docs.python-requests.org/zh_CN/latest/index.html
2.https://www.cnblogs.com/wupeiqi/articles/6283017.html
3.http://docs.python-requests.org/en/master/
4.https://www.crummy.com/software/BeautifulSoup/bs4/doc/
以上是关于爬虫工具篇汇总的主要内容,如果未能解决你的问题,请参考以下文章