Scrapy从青铜到王者第二篇:Scrapy进阶
Posted J哥。
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy从青铜到王者第二篇:Scrapy进阶相关的知识,希望对你有一定的参考价值。
--分布式爬虫:
- 概念:我们需要分布式的机群(多台电脑完成),让其对同一组资源进行分布联合爬取
- 作用:提升爬取数据的效率高
- 如何实现分布式?
- 安装一个 scrapy-redis的组件
- 原生的scrapy是不可以实现分布式爬虫,必须要让scrapy 结合着scrapy-redis组件一起实现分布式爬虫:调度器共享
- 为什么scrapy不可以实现分布式爬虫?
- 1.调度器不可以被分布式机群共享
- 2.管道不可以被分布式机群共享
- scrapy-redis组件作用: 持久化存储redis数据库中
- 可以给原生的scrapy框架提供可以被共享的管道和调度器
- 分布式的实现流程:
- 创建工程
- 创建一个基于CrawlSpider的爬虫文件
scrapy startproject fbsPro
fbsPro>scrapy genspider -t crawl fsb www.xxx.com
- 修改当前的爬虫文件:
1.导包 from scrapy_redis.spiders import RedisCrawlSpider
2.将start_urls和allowed_domains进行注释 //就是放链接的
3.添加一个新属性: redis_key = 'sun' 可以被共享调度器的队列的名称
4.编写数据解析相关的操作
5.将当前爬虫类的父类修改成RedisCrawlSpider :FsbSpider(RedisCrawlSpider)
6.修改配置文件settings
- 制定使用可以被共享的管道:
# 开启 redis 管道
TIEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
- 制定调度器:
# 指定的调度器
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重规则对应处理的类
SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 自带的调度器
7.redis相关操作配置
- 配置redis的配置文件:
- liunx或者mac配置文件名称:redis.conf
- windows: redis.windows.conf
- 打开配置文件修改:
- 把 bind 127.0.1 注释掉
- 关闭保护模式:protected-mode yes -->yes改为no
- 结合配置文件开启redis服务
- redis-server + 配置文件
- 启动客户端redis-cli
- 执行工程:
- scrapy runspider xxx.py
- 向调度器的队列中放入一个起始的url:
- 调度器的队列在 redis 的客户端中
- lpush sun(队列名称) +url
- 制定 redis
# 制定redis
REDIS_HOST = '127.0.0.1' # redis远程服务区的IP(需要修改 ,自己做测试 写的本机)
REDIS_POST = 6379
- 爬取的数据储存到 redis 的 proName:items这个数据结构中
# Scrapy settings for fbsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'fbsPro'
SPIDER_MODULES = ['fbsPro.spiders']
NEWSPIDER_MODULE = 'fbsPro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'fbsPro (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'fbsPro.middlewares.FbsproSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'fbsPro.middlewares.FbsproDownloaderMiddleware': 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
# 'fbsPro.pipelines.FbsproPipeline': 300,
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 开启 redis 管道
TIEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
# 指定的调度器
# 增加去重容器类的配置队列
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重规则对应处理的类
SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 自带的调度器
# 制定redis
REDIS_HOST = '127.0.0.1' # redis远程服务区的IP(需要修改 ,自己做测试 写的本机)
REDIS_POST = 6379
# 每一个爬虫都有自己自己的历史记录
'''
{
里面是全部的爬虫(里面有相对应的爬虫记录)
chouti:requets(封装了>>url:'',callback=''):'xx结果'
由于redis不能存放request对象,所以需要序列化一下,生成字符串然后保存在redis里面,作为key存在
pickle.dumps(chouti:requets,requets里面封装了要访问url和回调函数,chouti:requets就是key,要去这里面的数据的时候应该也是conn.smembers('chouti:requets')
}
'''
# SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化,默认使用pickle
# #将requets对象进行序列化处理,作为key保存
# # SCHEDULER_PERSIST = False # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
# ##是否在关闭的时候保留数据REDIS_PARAMS
# SCHEDULER_FLUSH_ON_START = True # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
# ##在爬虫启动的时候清空或者是不清空
# # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
# # 当没有数据的时候,最多等待的时间
# SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重规则,在redis中保存时对应的key》》chouti:dupefilter
# ##爬虫相对应的记录,对应的键
# SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重规则对应处理的类
# START_URLS_KEY = '%(name)s:start_urls'
# ##你要保存去重规则的键
# REDIS_START_URLS_AS_SET = False
以上是关于Scrapy从青铜到王者第二篇:Scrapy进阶的主要内容,如果未能解决你的问题,请参考以下文章