Python：Scrapy和Reddit

Question

我正在为聊天机器人实现数据管道。我正在使用scrapy抓取特定的subreddits以收集提交ID（不可能使用praw - Python Reddit API Wrapper）。

此外，我正在使用praw递归收到所有评论。这两种实现都已经有效。

但是，在几页之后，reddit会拒绝抓取subreddits（取决于获取请求的速度，......）。

我不想破坏任何规则，但是有没有正确的scrapy配置（DOWNLOAD_DELAY或其他节流机制），这些配置在reddit规则中以收集这些信息？

我的scrapy蜘蛛：

# -*- coding: utf-8 -*-
import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    allowed_domains = ["reddit.com"]

    def __init__(self, subreddit=None, pages=None, *args, **kwargs):
        super(RedditSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['https://www.reddit.com/r/%s/new/' % subreddit]
        self.pages = int(pages)
        self.page_count = 0

    def parse(self, response):

        # Extracting the content using css selectors
        titles = response.css('.title.may-blank::text').extract()
        # votes = response.css('.score.unvoted::text').extract()
        # times = response.css('time::attr(title)').extract()
        # comments = response.css('.comments::text').extract()
        submission_id = response.css('.title.may-blank').xpath('@data-outbound-url').extract()
        # submission_id = submission_id[24:33]

        # Give the extracted content row wise
        # for item in zip(titles, votes, times, comments, titles_full):
        for item in zip(titles, submission_id):
            # create a dictionary to store the scraped info
            scraped_info = {
                'title': item[0],
                'submission_id': item[1][23:32]
                # 'vote': item[2],
                # 'created_at': item[3],
                # 'comments': item[4]
            }

            # yield or give the scraped info to scrapy
            yield scraped_info

        if (self.pages > 1) and (self.page_count < self.pages):
            self.page_count += 1
            next_page = response.css('span.next-button a::attr(href)').extract_first()
            if next_page is not None:
                print("next page ... " + next_page)
                yield response.follow(next_page, callback=self.parse)

            if next_page is None:
                print("no more pages ... lol")

我的蜘蛛配置：

# -*- coding: utf-8 -*-

# Scrapy settings for reddit_crawler_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'reddit_crawler_scrapy'

SPIDER_MODULES = ['reddit_crawler_scrapy.spiders']
NEWSPIDER_MODULE = 'reddit_crawler_scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'reddit_crawler_scrapy university project m.reichart@hotmail.com'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.RedditCrawlerScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'reddit_crawler_scrapy.pipelines.RedditCrawlerScrapyPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"


# RANDOMIZE_DOWNLOAD_DELAY = False

LOG_FILE='scrapy_log.txt'

我已经将DOWNLOAD_DELAY设置为5秒，这将从RANDOMIZE_DOWNLOAD_DELAY定义的方法中乘以0.5到1.5之间的随机数。这相当于一个获取请求/下载2.5秒到7.5秒之间的东西已经安静缓慢，但可以在几个小时/几天内完成工作。

仍然很难，几页后我没有收到下一页，最后一个调用页面引导我提交reddit提供的提交，其中包含如何正确设置机器人的链接（imho讽刺的语气 - 播放的reddit）。