urllib 模块 - module urllib

Posted 2020-10-17 zzYzz
tags:
篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了urllib 模块 - module urllib相关的知识，希望对你有一定的参考价值。
  1 urllib 模块 - urllib module
  2 
  3 获取 web 页面,
  4     html = urllib.request.urlopen("http://www.zzyzz.top/")
  5     html2 = urllib.request.Request("http://www.zzyzz.top/")
  6     print("html",html)
  7     print("html2",html2)
  8     
  9     output,
 10         html <http.client.HTTPResponse object at 0x0395DFF0>
 11         html2 <urllib.request.Request object at 0x03613930>
 12         
 13         Methods of HTTPResponse object,
 14             geturl() — return the URL of the resource retrieved, 
 15                     commonly used to determine if a redirect was followed
 16                     得到最终显示给用户的页面的 url (并不一定是所提供参数的 url, 因为有可能有
 17                     redirect 情况)
 18                     
 19             info() — return the meta-information of the page, such as headers, in the
 20                     form of an email.message_from_string() instance (see Quick Reference
 21                     to HTTP Headers)
 22                     
 23             getcode() – return the HTTP status code of the response.
 24         
 25         Methods of Request object,
 26             Request.full_url
 27                 The original URL passed to the constructor.        
 28                 Request.full_url is a property with setter, getter and a deleter. 
 29                 Getting full_url returns the original request URL with the fragment, 
 30                 if it was present.
 31                 即 ‘URL‘ 参数(区别于 HTTPResponse object 的 geturl() 方法)
 32             
 33             Request.type
 34                 The URI scheme.
 35                 ‘http‘ , ‘https‘ 等 字符串 
 36             
 37             Request.host
 38                 The URI authority, typically a host, but may also contain a port 
 39                 separated by a colon.
 40                 即 host IP Addr. (可能会同时得到 port 端口号)
 41                 
 42             Request.origin_req_host
 43                 The original host for the request, without port.
 44                 即 host IP Addr, 不含 port 信息.
 45                 
 46             Request.selector
 47                 The URI path. If the Request uses a proxy, then selector will be the 
 48                 full URL that is passed to the proxy.
 49                 即 访问 server 的 path(相对于server 的 root 来说), 
 50                 例如  ‘/‘ 表示 server root 跟目录. 
 51             
 52             Request.data
 53                 The entity body for the request, or None if not specified.
 54                 例如 POST 的 form 信息.  urllib.request.Request("http://www.zzyzz.top/",data)
 55                     # data = {"Hi":"Hello"}
 56                        
 57             Request.unverifiable
 58                 boolean, indicates whether the request is unverifiable as defined by RFC 2965.
 59             
 60             Request.method
 61                 The HTTP request method to use. By default its value is None, which means
 62                 that get_method()will do its normal computation of the method to be used. 
 63                 Its value can be set (thus overriding the default computation in get_method())
 64                 either by providing a default value by setting it at the class level in a 
 65                 Request subclass, or by passing a value in to the Request constructor 
 66                 via the method argument.
 67        
 68             Request.get_method()
 69                 Return a string indicating the HTTP request method. If Request.method 
 70                 is not None,return its value, otherwise return ‘GET‘ if Request.data 
 71                 is None, or ‘POST‘ if it’s not.This is only meaningful for HTTP requests.
 72                 ‘POST‘ 或者 ‘GET‘
 73             
 74             Request.add_header(key, val)
 75                 Add another header to the request. Headers are currently ignored by 
 76                 all handlers except HTTP handlers,where they are added to the list 
 77                 of headers sent to the server. Note that there cannot be more than 
 78                 one header with the same name, and later calls will overwrite previous
 79                 calls in case the key collides.Currently, this is no loss of HTTP 
 80                 functionality, since all headers which have meaning when used more 
 81                 than once have a (header-specific) way of gaining the same 
 82                 functionality using only one header.
 83             
 84             Request.add_unredirected_header(key, header)
 85                 Add a header that will not be added to a redirected request.
 86             
 87             Request.has_header(header)
 88                 Return whether the instance has the named header (checks both 
 89                 regular and unredirected).
 90             
 91             Request.remove_header(header)
 92                 Remove named header from the request instance (both from regular 
 93                 and unredirected headers).
 94             
 95             Request.get_full_url()
 96                 Return the URL given in the constructor.
 97                 得到的其实是  Request.full_url 
 98             
 99             Request.set_proxy(host, type)
100                 Prepare the request by connecting to a proxy server. The host and 
101                 type will replace those of the instance, and the instance’s selector 
102                 will be the original URL given in the constructor.
103             
104             Request.get_header(header_name, default=None)
105                 Return the value of the given header. If the header is not present, 
106                 return the default value.
107             
108             Request.header_items()
109                 Return a list of tuples (header_name, header_value) of the Request headers.
110                            
111     例子, 获取 html codes,
112         urlobj = urllib.request.Request("http://www.zzyzz.top/")
113         with urllib.request.urlopen(urlobj) as FH:           # 文件类对象
114             print(FH.read().decode(‘utf8‘))
115 
116 Authentication,
117     当访问一个需要进行认证的 URL, 会得到一个 HTTP 401 错误,表示所访问的 URL 需要 Authentication.
118     Authentication 通常由种形式,
119         1, 浏览器 explorer 显示一个弹出框, 要求用户提供 用户名 密码进行认证, 它是基于 cookies 的.
120         2, form 表单形式的认证, 在 web 界面要求用户提供 用户名 密码, 然后通过 POST 方法将认证信息
121             发送给 server 端进行认证.
122         
123         基于 cookies 的 Authentication 认证  -  Basic HTTP Authentication                    
124             import urllib.request
125             # Create an OpenerDirector with support for Basic HTTP Authentication...
126             auth_handler = urllib.request.HTTPBasicAuthHandler()
127             auth_handler.add_password(realm= None,
128                                       uri="http://www.zzyzz.top/",
129                                       user=‘userid‘,
130                                       passwd=‘password‘)
131             opener = urllib.request.build_opener(auth_handler)
132             # ...and install it globally so it can be used with urlopen.
133             urllib.request.install_opener(opener)
134             html = urllib.request.urlopen("http://www.zzyzz.top/")
135             print(html.read().decode(‘utf8‘))
136         
137         基于 form 表单的 Authentication 认证,
138             再 server 端是通常这样处理, 对用户 submit(POST) 的 form 表单的数据信息做验证,
139             若验证通过 redirect 到授权页面, 否者 redirect 到 login 界面要求用户重新 POST 
140             认证信息.
141             所以对于这一类的认证, 正常按照 POST form 的方法对待就可以了.
142             urlobj = urllib.request.Request("http://www.zzyzz.top/",{"id":"userid","pw":"password"})
143             with urllib.request.urlopen(urlobj) as FH:           # 文件类对象
144                 print(FH.read().decode(‘utf8‘))
145 
146 异常处理 - error handling
147 
148 其他协议 - other protocols except HTTP
149  
150 Reference,
151     https://docs.python.org/3/library/urllib.request.html#module-urllib.request
以上是关于urllib 模块 - module urllib的主要内容，如果未能解决你的问题，请参考以下文章
Python爬虫从入门到进阶之urllib库的使用
module ‘urllib‘ has no attribute ‘urlretrieve‘
urllib模块
python3 AttributeError: module 'urllib' has no attribute 'urlencode'
Python的urllib和urllib2模块