python 网络爬虫入门笔记

Posted 2020-06-17

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 网络爬虫入门笔记相关的知识，希望对你有一定的参考价值。

参考：http://www.cnblogs.com/xin-xin/p/4297852.html

一、简介

　　爬虫即网络爬虫，如果将互联网比做成一张大网，那么蜘蛛就是爬虫。如果它遇到资源，将会抓取下来。

二、过程

　　在我们浏览网页时，我们经常会看到一些形形色色的页面，其实这个过程就是我们输入url，经DNS解析成对应的ip找到对应的服务器主机，向服务器发出一个请求，服务器经过解析之后将html，js等发回浏览器显示。

　　其实爬虫和这个过程差不多，只不过我们在抓取到html后，通过正则表达式来确定要获取的内容。

三、urllib库的使用

　　1.抓住页面的html：

#!/usr/bin/python
# -*- coding: utf-8 -*-  
import urllib,urllib2
url = ‘http://www.baidu.com‘
response = urllib2.urlopen(url)
html = response.read()
print html

　　2.构造request

　　比如，将上面代码可以这样改写：

#!/usr/bin/python
# -*- coding: utf-8 -*-  
import urllib,urllib2
url = ‘http://www.baidu.com‘
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
print html

　　3.GET和POST数据的传输

　　POST：

　　注：只是演示方法由于网站还有header cookie 等验证代码并不能登陆

#!/usr/bin/python
# -*- coding: utf-8 -*-  
import urllib,urllib2
values = {"username":"xxxxxx","password":"xxxxxx"}
data = urllib.urlencode(values)
url = "http://www.xiyounet.org/checkout/"
request = urllib2.Request(url,data)
reponse = urllib2.urlopen(request)
print reponse.read()

　　GET:

#!/usr/bin/python
# -*- coding: utf-8 -*-  
import urllib,urllib2
values = {"username":"xxxxxx","password":"xxxxxx"}
data = urllib.urlencode(values)
url = "http://www.xiyounet.org/checkout/"
geturl = url + "?" +data
request = urllib2.Request(geturl)
reponse = urllib2.urlopen(request)
print reponse.read()

　　4.设置headers

　　由于大多数网站并不能像上面一样登陆，为了能够更全面的模拟浏览器，所以我们有必要来学习header

#!/usr/bin/python
# -*- coding: utf-8 -*-  
import urllib,urllib2
url = "http://www.xiyounet.org/checkout/"
user_agent =  "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"
referer = "http://www.xiyounet.org/checkout/"
values = {"username":"xxxxx","password":"xxxxx"}
headers = {‘User-Agent‘:user_agent,‘Referer‘:referer}
data = urllib.urlencode(values)
request = urllib2.Request(url,data,headers)
reponse = urllib2.urlopen(request)
print reponse.read()

　　5.cookie的使用

　　　　⑴cookie是指一些网站用于辨别用户身份、进行session跟踪而存储在用户本地终端上的数据（一般是被加密）。我们在爬虫时，如果遇到有登陆的网站，若没有登陆是不　允许抓取的，我们可以获取到cookie后模拟登陆，从而达到抓取目的

　　urllib2中两个重要的概念：

　openers：我们都知道 urlopen()这个函数，其实它就是urllib2函数的opener，其实我们也可以去创建自己喜欢的opener
handler：

http://www.jb51.net/article/46495.htm

　　　⑵cookielib模块:它的功能主要是提供可供存储的cookie对象配合urllib2来访问internet 我们可以用该模块的cookiejar类的对象来获取cookie：

　　　　　　它与urllib2模块结合使用来模拟登陆，主要的方法有：CookieJar ， FileCookieJar , MozillaCookieJar ,LWPcookieJar

#!usr/bin/python
#coding:utf-8
import urllib2
import cookielib
#声明一个cookieJar对象实例来保存cookie
cookie = cookielib.CookieJar()
#利用urllib2库的HTTPCookieProcessor的对象来创建cookie处理器
handler = urllib2.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib2.build_opener(handler)
#后面也可以request访问
response = opener.open("http://www.xiyounet.org/checkout/")
for item in cookie:
    print ‘Name = ‘+item.name
    print ‘value = ‘+item.value

　　　　⑶将cookie保存至文件

#!usr/bin/python
#coding:utf-8
import urllib2
import cookielib

filename = ‘cookie.txt‘
#声明一个MozillaCookieJar对象实例来保存cookie,并写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib2库的HTTPCookieProcessor的对象来创建cookie处理器
handler = urllib2.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib2.build_opener(handler)
#后面也可以request访问
response = opener.open("http://www.xiyounet.org/checkout/")
#save方法的两个参数
#ignore_discard：保存cookie
#ignore_expires：如果存在则覆盖
cookie.save(ignore_discard = True,ignore_expires = True)

　　　　⑷从文件中读取：

#usr/bin/python
#coding:utf-8
import cookielib
import urllib2

#创建MozillaCookieJar实例对象
cookie = cookielib.MozillaCookieJar()
#从文件中读取cookie内容到变量
cookie.load(‘cookie.txt‘, ignore_discard=True, ignore_expires=True)
#创建请求的request
req = urllib2.Request("http://www.xiyounet.org/checkout/")
#利用urllib2的build_opener方法创建一个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(req)
print response.read()

　　　　⑸实战：登陆签到系统

　　　　可能是服务器设置了什么权限，这个返回400

#usr/bin/python
#coding:utf-8
import cookielib
import urllib2
import urllib
url = "http://www.xiyounet.org/checkout/index.php"
passdata = urllib.urlencode({‘Username‘:‘songxl‘,‘Password‘:‘Songxl123456‘})
cookiedata = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0","Referer":"http://www.xiyounet.org/checkout/","Host":"http://www.xiyounet.org"}
#设置保存cookie的文件，同级目录下的cookie.txt
filename = ‘cookie.txt‘
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib2.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib2.build_opener(handler)
req = urllib2.Request(url.encode(‘utf-8‘),passdata,cookiedata)
result = opener.open(req)
print result.read()

以上是关于python 网络爬虫入门笔记的主要内容，如果未能解决你的问题，请参考以下文章

Python网络爬虫与信息提取—requests库入门

python网络爬虫入门

爬虫入门笔记

从0教你用Python写网络爬虫，内容详细代码清晰，适合入门学习

如何入门 Python 爬虫?

Python网络爬虫入门篇