python爬虫

Posted 2020-07-02

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python爬虫相关的知识，希望对你有一定的参考价值。

　　本文主要是记录一下学习过程，相当于做一次笔记吧

　　主要参考崔庆才的Python爬虫学习系列教程(http://cuiqingcai.com/1052.html)

　　这里主要是一些Python的基础知识和爬糗事百科的一个实例：

　　一：基础知识

　　　　1.爬虫：趴在网络上的蜘蛛，遇见想要的资源，就会抓取下来。

　　　　2.浏览网页的过程：用户输入网站->DNS服务器->找到服务器主机->向服务器发送请求->服务器解析->发给浏览器相应的文件->浏览器解析

　　　　3.url：统一资源定位符（网址）：是对互联网上的资源的定位和访问方式的表示，是互联网上标准资源的地址。互联网上每个文件对应着一个URL。（协议+IP(有时有端口号)+具体地址）

　　二：urllib库的使用：

　　　　urlopen(url,data,timeout):data是访问URL时要传送的数据，timeout是设置超时（有默认值）

　　　　response = urllib2.urlopen(URL)

　　　　print response.read() : 返回获取到的网页内容

　　　　print response : 返回对该对象的描述（个人理解：类似于指针和指针所指向的内容）

　　　　request = urllib2.Requset(URL)

　　　　response =urllib2.urlopen(request)

　　　　(建立一个request，服务器响应，用户接受数据)

　　　　Post和get：

　　　　get:直接以链接形式访问，链接中包含参数，post则不会显示参数

　　　　post:

　　　　　　values = {"name":"[email protected]","pwd":"xxx"}#理解为序列化

　　　　　　data=urllib.urlencode(values)

　　　　　　url = "URL"

　　　　　　requset = urllib2.Request(url,data)

　　　　　　response = urllib2.urlopen(request)

　　　　GET:

　　　　　　values = {"name":"[email protected]","pwd":"xxx"}#理解为序列化

　　　　　　data=urllib.urlencode(values)

　　　　　　url="URL"

　　　　　　gurl=url+"?"+data

　　　　　　request = urllib2.Request(gurl)

　　　　　　response = urllib2.urlopen(request)

　　　　设置 Headers：为了模拟浏览器，需要有一个请求身份

　　　　　　user_agent=‘Mozilla/4.0(compatioble;MSIE5.5;Windows NT)‘

　　　　　　headers = {‘User-Agent‘:user_agent}　　

　　　　　　data= DATA

　　　　　　request = urllib2.Request(url,data,headers)

　　　　　　response = urllib2.urlopen(request)

　　　　代理Proxy：每隔一段时间换一个代理：

　　　　　　enable_proxy=True

　　　　　　proxy_handler=urllib2.ProxyHandler({"http":‘http://some-proxy.com:8080‘})

　　　　　　null_proxy_handler = urllib2.ProxyHandler({})

　　　　　　if enable_proxy:

　　　　　　　　opener = urllib2.build_opener(proxy_handler)

　　　　　　else:

　　　　　　　　 opener = urllib2.build_opener(null_proxy_handler)

　　　　　　urllib2.install_opener(opener)

　　出错处理：

　　cookie:

　　以及一个爬虫的例子：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
#!/usr/bin/env python # -*- coding: utf-8 -*- 
"""
Created on Tue Mar 22 19:44:06 2016

@author: mz
"""

import urllib
import re
import urllib2

page = 2
url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page)
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
headers = { ‘User-Agent‘ : user_agent }

try:
    request = urllib2.Request(url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode(‘utf-8‘)
    #pattern = re.compile(‘<div class="author clearfix>.*?title="(.*?)">\n<h2>.*?"content">(.*?)\n<!--.*?<span class="stats-vote"<i class="number">(*?)</i>\s[\u4e00-\u9fa5][\u4e00-\u9fa5].*?tagert="_blank">\n<i class="number">(.*?)</i>\s[\u4e00-\u9fa5][\u4e00-\u9fa5]\n</a>‘,re.S)    
    pattern = re.compile(‘<div class="author clearfix">.*?title.*?>\n<h2>(.*?)</h2>.*?<div class="content">(.*?)<!--.*?-->.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<span class="dash">.*?<i class="number">(.*?)</i>.*?‘,re.S)    
    items = re.findall(pattern,content)
    for item in items:
        print item[0],item[1],item[2],item[3]
    print "no"
except urllib2.URLError,e:
    if hasattr(e,‘code‘):
        print e.code
    if hasattr(e,‘reason‘):
        print e.reason

以上是关于python爬虫的主要内容，如果未能解决你的问题，请参考以下文章

Python练习册第 0013 题：用 Python 写一个爬图片的程序，爬这个链接里的日本妹子图片 :-)，(http://tieba.baidu.com/p/2166231880)(代码片段

python爬虫学习笔记-M3U8流视频数据爬虫

爬虫遇到头疼的验证码？Python实战讲解弹窗处理和验证码识别

python网络爬虫

Python 利用爬虫爬取网页内容（div节点的疑惑）

为啥我的python爬虫界面与博主不一样