按顺序运行Multiple Spider

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了按顺序运行Multiple Spider相关的知识,希望对你有一定的参考价值。

Class Myspider1
#do something....

Class Myspider2
#do something...

以上是我的spider.py文件的架构。我试图先运行Myspider1,然后根据某些条件运行Myspider2倍数。我怎么能这样做?有小费吗?

configure_logging()
runner = CrawlerRunner()
def crawl():
    yield runner.crawl(Myspider1,arg.....)
    yield runner.crawl(Myspider2,arg.....)
crawl()
reactor.run()

我试图用这种方式。但不知道如何运行它。我应该在cmd上运行cmd(什么命令?)或者只运行python文件?

非常感谢!!!

答案

运行python文件 例如:test.py

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
               "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        print "first spider"

class MySpider2(scrapy.Spider):
    # Your second spider definition
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
                "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        print "second spider"

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

现在运行python test.py> output.txt 您可以从output.txt中观察到您的蜘蛛顺序运行。

另一答案

您需要使用process.crawl()返回的Deferred对象,它允许您在爬网完成时添加回调。

这是我的代码

def start_sequentially(process: CrawlerProcess, crawlers: list):
    print('start crawler {}'.format(crawlers[0].__name__))
    deferred = process.crawl(crawlers[0])
    if len(crawlers) > 1:
        deferred.addCallback(lambda _: start_sequentially(process, crawlers[1:]))

def main():
    crawlers = [Crawler1, Crawler2]
    process = CrawlerProcess(settings=get_project_settings())
    start_sequentially(process, crawlers)
    process.start()

以上是关于按顺序运行Multiple Spider的主要内容,如果未能解决你的问题,请参考以下文章

html 将以编程方式附加外部脚本文件的javascript代码片段,并按顺序排列。用于响应式网站,其中ma

csharp 在Swashbuckle Swagger中,此片段允许按字母顺序显示操作。

Smallest Common Multiple

Smallest Common Multiple FreeCodeCamp

Scrapy Spider没有返回所有元素

Smallest Common Multiple