Python:Scrapy和Reddit

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python:Scrapy和Reddit相关的知识,希望对你有一定的参考价值。

我正在为聊天机器人实现数据管道。我正在使用scrapy抓取特定的subreddits以收集提交ID(不可能使用praw - Python Reddit API Wrapper)。

此外,我正在使用praw递归收到所有评论。这两种实现都已经有效。

但是,在几页之后,reddit会拒绝抓取subreddits(取决于获取请求的速度,......)。

我不想破坏任何规则,但是有没有正确的scrapy配置(DOWNLOAD_DELAY或其他节流机制),这些配置在reddit规则中以收集这些信息?

我的scrapy蜘蛛:

# -*- coding: utf-8 -*-
import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    allowed_domains = ["reddit.com"]

    def __init__(self, subreddit=None, pages=None, *args, **kwargs):
        super(RedditSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['https://www.reddit.com/r/%s/new/' % subreddit]
        self.pages = int(pages)
        self.page_count = 0

    def parse(self, response):

        # Extracting the content using css selectors
        titles = response.css('.title.may-blank::text').extract()
        # votes = response.css('.score.unvoted::text').extract()
        # times = response.css('time::attr(title)').extract()
        # comments = response.css('.comments::text').extract()
        submission_id = response.css('.title.may-blank').xpath('@data-outbound-url').extract()
        # submission_id = submission_id[24:33]

        # Give the extracted content row wise
        # for item in zip(titles, votes, times, comments, titles_full):
        for item in zip(titles, submission_id):
            # create a dictionary to store the scraped info
            scraped_info = {
                'title': item[0],
                'submission_id': item[1][23:32]
                # 'vote': item[2],
                # 'created_at': item[3],
                # 'comments': item[4]
            }

            # yield or give the scraped info to scrapy
            yield scraped_info

        if (self.pages > 1) and (self.page_count < self.pages):
            self.page_count += 1
            next_page = response.css('span.next-button a::attr(href)').extract_first()
            if next_page is not None:
                print("next page ... " + next_page)
                yield response.follow(next_page, callback=self.parse)

            if next_page is None:
                print("no more pages ... lol")

我的蜘蛛配置:

# -*- coding: utf-8 -*-

# Scrapy settings for reddit_crawler_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'reddit_crawler_scrapy'

SPIDER_MODULES = ['reddit_crawler_scrapy.spiders']
NEWSPIDER_MODULE = 'reddit_crawler_scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'reddit_crawler_scrapy university project m.reichart@hotmail.com'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.RedditCrawlerScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'reddit_crawler_scrapy.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'reddit_crawler_scrapy.pipelines.RedditCrawlerScrapyPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"


# RANDOMIZE_DOWNLOAD_DELAY = False

LOG_FILE='scrapy_log.txt'

我已经将DOWNLOAD_DELAY设置为5秒,这将从RANDOMIZE_DOWNLOAD_DELAY定义的方法中乘以0.5到1.5之间的随机数。这相当于一个获取请求/下载2.5秒到7.5秒之间的东西已经安静缓慢,但可以在几个小时/几天内完成工作。

仍然很难,几页后我没有收到下一页,最后一个调用页面引导我提交reddit提供的提交,其中包含如何正确设置机器人的链接(imho讽刺的语气 - 播放的reddit)。

答案

IMO反对reddits反爬行机制将花费你太多时间,我不会试图遵循这条道路。

他们有一个API来获取一个subreddit的所有帖子,例如https://www.reddit.com/r/subreddit/top.json?sort=top以qson格式获取/r/subreddit中的所有帖子,它看起来与您在其网站上看到的内容相同。

此外,their doc建议您使用oauth。然后他们让你每分钟做60个请求。我会走这条路。这也比刮擦更安全,因为只要他们在HTML布局中改变某些东西,刮擦就会坍塌。

以上是关于Python:Scrapy和Reddit的主要内容,如果未能解决你的问题,请参考以下文章

scrapy按顺序启动多个爬虫代码片段(python3)

Python之Scrapy安装

Scrapy Spider没有返回所有元素

python 纯Python中的置信度排序(来自Reddit的代码库)

搜索包含列表 PSAW python 中任何单词的 reddit 评论

分享《精通Python爬虫框架Scrapy》中文PDF+英文PDF+源代码+Python网络数据采集