Scrapy从青铜到王者第二篇:Scrapy进阶

Posted J哥。

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy从青铜到王者第二篇:Scrapy进阶相关的知识,希望对你有一定的参考价值。

--分布式爬虫:
     - 概念:我们需要分布式的机群(多台电脑完成),让其对同一组资源进行分布联合爬取
     - 作用:提升爬取数据的效率高
     - 如何实现分布式?
        - 安装一个 scrapy-redis的组件
        - 原生的scrapy是不可以实现分布式爬虫,必须要让scrapy 结合着scrapy-redis组件一起实现分布式爬虫:调度器共享
        - 为什么scrapy不可以实现分布式爬虫?
            - 1.调度器不可以被分布式机群共享
            - 2.管道不可以被分布式机群共享
        - scrapy-redis组件作用:    持久化存储redis数据库中
            - 可以给原生的scrapy框架提供可以被共享的管道和调度器
        - 分布式的实现流程:
            - 创建工程
            - 创建一个基于CrawlSpider的爬虫文件
                    scrapy startproject fbsPro
                    fbsPro>scrapy genspider -t crawl fsb www.xxx.com
            - 修改当前的爬虫文件:
                    1.导包 from scrapy_redis.spiders import RedisCrawlSpider


                    2.将start_urls和allowed_domains进行注释    //就是放链接的


                    3.添加一个新属性: redis_key = 'sun' 可以被共享调度器的队列的名称


                    4.编写数据解析相关的操作
                    5.将当前爬虫类的父类修改成RedisCrawlSpider  :FsbSpider(RedisCrawlSpider)
                    6.修改配置文件settings
                        - 制定使用可以被共享的管道:
                        #  开启 redis 管道
                                TIEM_PIPELINES = {
                                    'scrapy_redis.pipelines.RedisPipeline': 400
                                }
                        - 制定调度器:
                            # 指定的调度器
                            SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类
                            SCHEDULER_PERSIST = True        # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
                            SCHEDULER = 'scrapy_redis.scheduler.Scheduler'  # 自带的调度器
                    7.redis相关操作配置
                        - 配置redis的配置文件:
                            - liunx或者mac配置文件名称:redis.conf
                            - windows: redis.windows.conf
                            - 打开配置文件修改:
                                - 把 bind 127.0.1 注释掉
                                - 关闭保护模式:protected-mode yes -->yes改为no
                            - 结合配置文件开启redis服务
                                - redis-server + 配置文件
                                - 启动客户端redis-cli
                        - 执行工程:
                            - scrapy runspider xxx.py
                        - 向调度器的队列中放入一个起始的url:
                            - 调度器的队列在 redis 的客户端中
                                - lpush  sun(队列名称)  +url
                        - 制定 redis
                        # 制定redis
                        REDIS_HOST = '127.0.0.1'  # redis远程服务区的IP(需要修改 ,自己做测试 写的本机)
                        REDIS_POST = 6379
                        - 爬取的数据储存到 redis 的 proName:items这个数据结构中

# Scrapy settings for fbsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'fbsPro'

SPIDER_MODULES = ['fbsPro.spiders']
NEWSPIDER_MODULE = 'fbsPro.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'fbsPro (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'fbsPro.middlewares.FbsproSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'fbsPro.middlewares.FbsproDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'fbsPro.pipelines.FbsproPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


#  开启 redis 管道
TIEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}
# 指定的调度器
#  增加去重容器类的配置队列
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类
SCHEDULER_PERSIST = True  # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'  # 自带的调度器

# 制定redis
REDIS_HOST = '127.0.0.1'  # redis远程服务区的IP(需要修改 ,自己做测试 写的本机)
REDIS_POST = 6379

# 每一个爬虫都有自己自己的历史记录
'''
{
里面是全部的爬虫(里面有相对应的爬虫记录)
chouti:requets(封装了>>url:'',callback=''):'xx结果'
由于redis不能存放request对象,所以需要序列化一下,生成字符串然后保存在redis里面,作为key存在
pickle.dumps(chouti:requets,requets里面封装了要访问url和回调函数,chouti:requets就是key,要去这里面的数据的时候应该也是conn.smembers('chouti:requets')
}
'''
# SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 对保存到redis中的数据进行序列化,默认使用pickle
# #将requets对象进行序列化处理,作为key保存
# # SCHEDULER_PERSIST = False  # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
# ##是否在关闭的时候保留数据REDIS_PARAMS
# SCHEDULER_FLUSH_ON_START = True  # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
# ##在爬虫启动的时候清空或者是不清空
# # SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
# # 当没有数据的时候,最多等待的时间
# SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重规则,在redis中保存时对应的key》》chouti:dupefilter
# ##爬虫相对应的记录,对应的键
# SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类
# START_URLS_KEY = '%(name)s:start_urls'
# ##你要保存去重规则的键
# REDIS_START_URLS_AS_SET = False

 

 

 

以上是关于Scrapy从青铜到王者第二篇:Scrapy进阶的主要内容,如果未能解决你的问题,请参考以下文章

Linux从青铜到王者第十二篇:Linux进程间信号第二篇

Linux从青铜到王者第二篇:Linux权限管理

C++从青铜到王者第二十二篇:C++11

Git从青铜到王者第二篇:Git的初始

Linux从青铜到王者第六篇:Linux进程概念第二篇

C++从青铜到王者第二篇:C++类和对象(上篇)