爬虫-day02-抓取和分析

Posted 2020-11-03 albert-w

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫-day02-抓取和分析相关的知识，希望对你有一定的参考价值。

###页面抓取###

1、urllib3

    是一个功能强大且好用的HTTP客户端，弥补了Python标准库中的不足

    安装： pip install urllib3

    使用：

import urllib3
http = urllib3.PoolManager()
response = http.request(‘GET‘, ‘http://news.qq.com‘)
print(response.headers)
result = response.data.decode(‘gbk‘)
print(result)

发送HTTPS协议的请求

安装依赖 ： pip install certifi

import  certifi
import urllib3
http = urllib3.PoolManager(cert_reqs = ‘CERT_REQUIRED‘, ca_certs = certifi.where()) #添加证书
resp = http.request(‘GET‘, ‘http://news.baidu.com/‘)
print(resp.data.decode(‘utf-8‘))

####带上参数

import urllib3
from urllib.parse import urlencode
http = urllib3.PoolManager()
args = {‘wd‘ : ‘人民币‘}
# url = ‘http://www.baidu.com/s?%s‘ % (args)
url = ‘http://www.baidu.com/s?%s‘ % (urlencode(args))
print(url)
# resp = http.request(‘GET‘ , url)
# print(resp.data.decode(‘utf-8‘))
 
headers = {
    ‘Accept‘ : ‘text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01‘,
    ‘Accept-Encoding‘ : ‘gzip, deflate, br‘,
    ‘Accept-Language‘ : ‘zh-CN,zh;q=0.9‘,
    ‘Connection‘ : ‘keep-alive‘,
    ‘Host‘ : ‘www.baidu.com‘,
    ‘Referer‘ : ‘https://www.baidu.com/s?wd=人民币‘,
    ‘User-Agent‘ : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
}
resp8 = requests.get(url8, fields=args8, headers=headers8)
print(resp8.text)

以上是关于爬虫-day02-抓取和分析的主要内容，如果未能解决你的问题，请参考以下文章