Scrapy - 反应堆不可重启[重复]

Posted 2023-02-15

技术标签:

【中文标题】Scrapy - 反应堆不可重启[重复]【英文标题】：Scrapy - Reactor not Restartable [duplicate] 【发布时间】：2017-05-20 14:06:13 【问题描述】：

与：

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

我总是成功地运行这个过程：

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()

但由于我已将此代码移动到 web_crawler(self) 函数中，如下所示：

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2)

并开始使用类实例化调用方法，例如：

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

并运行：

test()

我收到以下错误：

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

怎么了？

【问题讨论】：

您是否在每个脚本中多次运行“web_crawler()”？您不能多次启动扭曲反应堆。我不知道。我正在做的是在类函数中定义爬虫函数，并使用 call 方法运行该进程。喜欢：results = test.web_crawler(). @Rejected 我已经编辑了问题，谢谢 【参考方案1】：

根据Scrapy documentation，CrawlerProcess 类的start() 方法执行以下操作：

“[...] 启动 Twisted reactor，将其池大小调整为 REACTOR_THREADPOOL_MAXSIZE，并根据 DNSCACHE_ENABLED 和 DNSCACHE_SIZE 安装 DNS 缓存。”

Twisted 抛出了您收到的错误，因为无法重新启动 Twisted reactor。它使用了大量的全局变量，即使您执行 jimmy-rig 某种代码来重新启动它（我已经看到它完成了），也不能保证它会起作用。

老实说，如果您认为需要重新启动反应堆，那么您可能做错了什么。

根据您想要做什么，我也会查看文档的 Running Scrapy from a Script 部分。

【讨论】：

【参考方案2】：

错误在这段代码中：

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler() 返回两个结果，为此它尝试启动进程两次，重新启动反应器，正如@Rejected 所指出的那样。

获取运行单个进程的结果，并将两个结果存储在一个元组中，这是这里的方法：

def __call__(self):
    result1, result2 = test.web_crawler()

【讨论】：

【参考方案3】：

您无法重新启动反应器，但您应该能够通过分叉一个单独的进程来运行它更多次：

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

运行两次：

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

结果：

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

【讨论】：

此解决方案有效。使用 Jupyter (Google Colab) 对其进行了测试。 [⚠️BEWARE⚠️] 有一个大警告：第一次使用时必须重新启动运行时。否则，您之前的反应堆的臃肿尸体仍然存在，因此您的分叉进程也将携带它们。之后，一切都会顺利进行，因为父进程将不再接触自己的反应器。谢谢，它对我也有用，顺便说一句，你能帮我看看结果吗，我坚持要得到结果.. 尝试运行上述代码时出现错误：AttributeError: Can't pickle local object 'run_spider.<locals>.f' 我注意到在 WSL 中运行 python 时，相同的代码运行流畅。所以这似乎是python for windows的一个问题。有一个关于'AttributeError: Can't pickle local object 'run_spider.<locals>.f' 的小问题，但是在外部调用f 的移动函数解决了我的问题，我可以运行代码【参考方案4】：

这解决了我的问题，把下面的代码放在reactor.run() 或process.start() 之后：

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

【讨论】：

您想将代码放在代码块中，方法是用记号 (`) 将其包围，或者通过突出显示它并按 ctrl + K(windows) 或 command + K (mac) 将其放置在代码块中这会杀死进程【参考方案5】：

这是帮助我战胜 ReactorNotRestartable 错误的原因：last answer from the author of the question 0) pip install crochet 1) import from crochet import setup 2) setup() - 在文件顶部 3) 删除 2 行：一）d.addBoth(lambda _: reactor.stop()) b) reactor.run() 我对这个错误也有同样的问题，花了 4 个多小时来解决这个问题，在这里阅读所有关于它的问题。终于找到那个了——分享一下。这就是我解决这个问题的方法。 Scrapy docs 左边唯一有意义的行是我的代码中的最后两行：

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

此代码允许我选择要运行的蜘蛛，只需将其名称传递给 run_spider 函数并在报废完成后 - 选择另一个蜘蛛并再次运行它。希望这会对某人有所帮助，因为它对我有所帮助:)

【讨论】：

当我调用 import_module 时出现错误：NameError: name 'import_module' is not defined @olegario 检查from importlib import import_module 我知道了，但是当我调用这个函数时，蜘蛛没有被触发 @olegario 是否有任何消息或错误或消息？我没有执行任何蜘蛛【参考方案6】：

正如一些人已经指出的那样：您不需要重新启动反应堆。

理想情况下，如果您想链接您的进程（crawl1 然后 crawl2 然后 crawl3），您只需添加回调。

例如，我一直在使用遵循这种模式的循环蜘蛛：

1. Crawl A
2. Sleep N
3. goto 1

这就是它在scrapy中的样子：

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: duration')
    time.sleep(duration)  # block here


def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d


def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()


if __name__ == '__main__':
    loop_crawl()

为了更详细地解释该过程，crawl 函数安排了一次爬取，并添加了两个额外的回调，它们在爬取结束时被调用：阻塞睡眠和递归调用自身（安排另一个爬取）。

$ python endless_crawl.py 
b'\n  "origin": "000.000.000.000"\n\n'
sleeping for: 5
b'\n  "origin": "000.000.000.000"\n\n'
sleeping for: 5
b'\n  "origin": "000.000.000.000"\n\n'
sleeping for: 5
b'\n  "origin": "000.000.000.000"\n\n'
sleeping for: 5

【讨论】：

我实际上在crawl.blog/scrapy-loop 上写了一篇关于此的广泛博客，并提供了功能丰富的实现gitlab.com/granitosaurus/scrapy-loop

以上是关于Scrapy - 反应堆不可重启[重复]的主要内容，如果未能解决你的问题，请参考以下文章