Python:Scrapy和Reddit
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python:Scrapy和Reddit相关的知识,希望对你有一定的参考价值。
我正在为聊天机器人实现数据管道。我正在使用scrapy抓取特定的subreddits以收集提交ID(不可能使用praw - Python Reddit API Wrapper)。
此外,我正在使用praw递归收到所有评论。这两种实现都已经有效。
但是,在几页之后,reddit会拒绝抓取subreddits(取决于获取请求的速度,......)。
我不想破坏任何规则,但是有没有正确的scrapy配置(DOWNLOAD_DELAY或其他节流机制),这些配置在reddit规则中以收集这些信息?
我的scrapy蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
class RedditSpider(scrapy.Spider):
name = 'reddit'
allowed_domains = ["reddit.com"]
def __init__(self, subreddit=None, pages=None, *args, **kwargs):
super(RedditSpider, self).__init__(*args, **kwargs)
self.start_urls = ['https://www.reddit.com/r/%s/new/' % subreddit]
self.pages = int(pages)
self.page_count = 0
def parse(self, response):
# Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
# votes = response.css('.score.unvoted::text').extract()
# times = response.css('time::attr(title)').extract()
# comments = response.css('.comments::text').extract()
submission_id = response.css('.title.may-blank').xpath('@data-outbound-url').extract()
# submission_id = submission_id[24:33]
# Give the extracted content row wise
# for item in zip(titles, votes, times, comments, titles_full):
for item in zip(titles, submission_id):
# create a dictionary to store the scraped info
scraped_info = {
'title': item[0],
'submission_id': item[1][23:32]
# 'vote': item[2],
# 'created_at': item[3],
# 'comments': item[4]
}
# yield or give the scraped info to scrapy
yield scraped_info
if (self.pages > 1) and (self.page_count < self.pages):
self.page_count += 1
next_page = response.css('span.next-button a::attr(href)').extract_first()
if next_page is not None:
print("next page ... " + next_page)
yield response.follow(next_page, callback=self.parse)
if next_page is None:
print("no more pages ... lol")
我的蜘蛛配置:
# -*- coding: utf-8 -*-
# Scrapy settings for reddit_crawler_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'reddit_crawler_scrapy'
SPIDER_MODULES = ['reddit_crawler_scrapy.spiders']
NEWSPIDER_MODULE = 'reddit_crawler_scrapy.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'reddit_crawler_scrapy university project m.reichart@hotmail.com'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'reddit_crawler_scrapy.middlewares.RedditCrawlerScrapySpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'reddit_crawler_scrapy.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'reddit_crawler_scrapy.pipelines.RedditCrawlerScrapyPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"
# RANDOMIZE_DOWNLOAD_DELAY = False
LOG_FILE='scrapy_log.txt'
我已经将DOWNLOAD_DELAY设置为5秒,这将从RANDOMIZE_DOWNLOAD_DELAY定义的方法中乘以0.5到1.5之间的随机数。这相当于一个获取请求/下载2.5秒到7.5秒之间的东西已经安静缓慢,但可以在几个小时/几天内完成工作。
仍然很难,几页后我没有收到下一页,最后一个调用页面引导我提交reddit提供的提交,其中包含如何正确设置机器人的链接(imho讽刺的语气 - 播放的reddit)。
IMO反对reddits反爬行机制将花费你太多时间,我不会试图遵循这条道路。
他们有一个API来获取一个subreddit的所有帖子,例如https://www.reddit.com/r/subreddit/top.json?sort=top以qson格式获取/r/subreddit
中的所有帖子,它看起来与您在其网站上看到的内容相同。
此外,their doc建议您使用oauth。然后他们让你每分钟做60个请求。我会走这条路。这也比刮擦更安全,因为只要他们在HTML布局中改变某些东西,刮擦就会坍塌。
以上是关于Python:Scrapy和Reddit的主要内容,如果未能解决你的问题,请参考以下文章
python 纯Python中的置信度排序(来自Reddit的代码库)