Redis实现分布式爬虫

Posted harryblog

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Redis实现分布式爬虫相关的知识,希望对你有一定的参考价值。

redis分布式爬虫 

概念:多台机器上可以执行同一个爬虫程序,实现网站数据的爬取
原生的scrapy是不可以实现分布式爬虫, 原因如下:

  • 调度器无法共享
  • 管道无法共享

scrapy-redis组件:专门为scrapy开发的一套组件。 该组件可以让scrapy实现分布式 pip install scrapy-redis

分布式爬取的流程:

1 redis配置文件的配置

  •  将 bind 127.0.0.1 进行注释
  •  将 protected-mode no 关闭保护模式

2 redis服务器的开启:基于配置文件的开启

3 创建scrapy工程后, 创建基于crawlSpider的爬虫文件

4 导入RedisCrawSpider类 from scrapy_redis.spiders import RedisCrawlSpider

5 将start_url修改成redis_key = ‘xxx‘

6 解析代码编写

7 将项目的管道和调度器配置成基于scrapy-redis组件中

ITEM_PIPELINES = 
    scrapy_redis.pipelines.RedisPipeline: 400

# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

 8 配置Redis服务器地址和端口

# 如果redis服务器不在本机,则需如下配置
REDIS_HOST = 192.168.0.108
REDIS_PORT = 6379
REDIS_PARAMS = "password":123456

9 执行爬虫文件

scrapy runspider qiubai

10 向调度器队列中扔入一个起始url(在redis客户端中操作):lpush redis_key属性值 起始url

lpush qiubaispider https://www.qiushibaike.com/pic/

实现代码

class QiubaiSpider(RedisCrawlSpider):
    name = qiubai
    # allowed_domains = [‘www.qiushibaike.com/pic‘]
    # start_urls = [‘http://www.qiushibaike.com/pic/‘]
    redis_key = qiubaispider  # 表示跟start_urls含义一样
    link = LinkExtractor(allow=r/pic/page/\d+)
    rules = (
        Rule(link, callback=parse_item, follow=True),
    )

    def parse_item(self, response):
        print(开始爬虫)
        div_list = response.xpath(//*[@id="content-left"]/div)
        for div in div_list:
            print(div)
            img_url = "http://" + div.xpath(.//div[@class="thumb"]/a/img/@src).extract_first()
            item = RedisproItem()
            item[img_url] = img_url
            yield item

基于RedisSpider的分布式爬虫

案例需求:爬取的是基于文字的新闻数据(国内, 国际,军师, 航空)

  • 1 在爬虫文件中导入webdriver类
  • 2 在爬虫文件的爬虫类的构造方法中进行了浏览器实例化操作
  • 3 在爬虫类的closed方法中进行浏览器的关闭操作
  • 4 在下载中间件的process_response方法中编写执行浏览器自动化操作

wangyi.py:

 

middlewares.py:

from scrapy import signals
from scrapy.http import htmlResponse
class WanyiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # 拦截到响应对象(下载器传递给Spider的响应对象)
        # request: 响应对象对应的请求对象
        # response: 拦截到的响应对象
        # spider: 爬虫文件对应的爬虫类的实例
        print(request.url + "这是下载中间件")
        # 响应对象中存储页面数据的篡改
        if request.url in [http://news.163.com/domestic/, http://news.163.com/world/, http://war.163.com/,
                           http://news.163.com/air/]:
            spider.bro.get(url=request.url)
            js = window.scrollTo(0,document.body.scrollHeight)
            spider.bro.execute_script(js)
            time.sleep(2)  # 一定要给与浏览器一定的缓冲加载数据的时间
            # 页面数据包含了动态加载出来的新闻数据对应的页面数据
            page_text = spider.bro.page_source
            return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding=utf-8, request=request)
        else:
            return response

UA池和地址池:

from scrapy import signals
from scrapy.http import HtmlResponse
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

user_agent_list = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
    "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
    "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
    "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
    "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
    "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
    "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
    "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
    "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
    "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
    "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

# UA池代码的编写(单独给UA池封装一个下载中间件的一个类)
# 导包UserAgentMiddleware类
class RandomUserAgent(UserAgentMiddleware):
    def process_request(self, request, spider):
        # 从列表中随机抽选一个ua值
        ua = random.choice(user_agent_list)
        # ua值进行当前拦截到请求的ua的写入操作
        request.headers.setdefault(User-Agent, ua)


# 可被选用的代理IP
PROXY_http = [
    153.180.102.104:80,
    195.208.131.189:56055,
]
PROXY_https = [
    120.83.49.90:9000,
    95.189.112.214:35508,
]

# 批量对拦截到的请求进行IP更换
class Proxy(object):
    def process_request(self, request, spider):
        # 对拦截到请求的url进行判断(协议头到底是http还是https)
        # request.url返回值:http://www.xxx.com
        h = request.url.split(:)[0]  # 请求的协议头
        if h == https:
            ip = random.choice(PROXY_https)
            request.meta[proxy] = https:// + ip
        else:
            ip = random.choice(PROXY_http)
            request.meta[proxy] = http:// + ip

基于RedisSpider实现分布式爬虫步骤

1 导包:from scrapy_redis.spiders import RedisSpider
2 将爬虫类的父类修改成RedisSpider
3 将起始URL列表注释, 添加一个redis_key(调度器队列的名称)的属性
4 进行redis数据库配置文件的配置:

  • 将 bind 127.0.0.1 进行注释
  • 将 protected-mode no 关闭保护模式

5 settings中配置redis

REDIS_HOST = 192.168.0.108
REDIS_PORT = 6379
REDIS_PARAMS = "password": 123456

# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

ITEM_PIPELINES = 
    scrapy_redis.pipelines.RedisPipeline: 400

6  执行爬虫文件

scrapy runspider wangyi.py

7 向调度器的管道中扔一个起始url

lpush wangyi https://news.163.com/

 

以上是关于Redis实现分布式爬虫的主要内容,如果未能解决你的问题,请参考以下文章

分布式爬虫

分布式爬虫

java-爬虫-14-采用Redis创建url仓库,实现分布式爬虫

Python3爬虫学习分布式爬虫第一步--Redis分布式爬虫初体验

分布式爬虫Scrapy_redis原理分析并实现断点续爬

基于Python使用scrapy-redis框架实现分布式爬虫 注