怎么使用python脚本运行多个scrapy爬虫

Posted 2023-05-08

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了怎么使用python脚本运行多个scrapy爬虫相关的知识，希望对你有一定的参考价值。

参考技术A 1、创建多个spider，
scrapy
genspider
spidername
domain
scrapy
genspider
CnblogsHomeSpider
cnblogs.com
通过上述命令创建了一个spider
name为CnblogsHomeSpider的爬虫，start_urls为
、查看项目下有几个爬虫scrapy
list
[root@bogon
cnblogs]#
scrapy
list
CnblogsHomeSpider
CnblogsSpider
由此可以知道我的项目下有两个spider，一个名称叫CnblogsHomeSpider，另一个叫CnblogsSpider。

scrapy按顺序启动多个爬虫代码片段(python3)

问题：在运行scrapy的过程中，如果想按顺序启动爬虫怎么做？

背景：爬虫A爬取动态代理ip，爬虫B使用A爬取的动态代理ip来伪装自己，爬取目标，那么A一定要在B之前运行该怎么做？

IDE：pycharm

版本：python3

框架：scrapy

系统：windows10

代码如下：（请自行修改）

# !/usr/bin/env python3
# -*- coding:utf-8 -*-
from scrapy import cmdline
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from torrentSpider.spiders.proxy_ip_spider import ProxyIpSpider
from torrentSpider.spiders.douban_spider import DoubanSpider
from scrapy.utils.project import get_project_settings

‘‘‘
以下是多个爬虫顺序执行的命令
‘‘‘
configure_logging()
# 加入setting配置文件，否则配置无法生效
# get_project_settings()获取的是setting.py的配置

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(ProxyIpSpider)
    yield runner.crawl(DoubanSpider)
    reactor.stop()


crawl()
reactor.run()  # the script will block here until the last crawl call is finished

‘‘‘
以下是单个爬虫执行的命令
‘‘‘
# def execute():
    # cmdline.execute([‘scrapy‘, ‘crawl‘, ‘proxy_ip_spider‘])
#
#
# execute()

以上是关于怎么使用python脚本运行多个scrapy爬虫的主要内容，如果未能解决你的问题，请参考以下文章

scrapy按顺序启动多个爬虫代码片段(python3)

python爬虫scrapy之如何同时执行多个scrapy爬行任务

通过核心ＡＰＩ启动单个或多个scrapy爬虫

4.python爬虫之新建 scrapy 爬虫项目(抓取和保存)

Python爬虫编程思想（153）：使用Scrapy抓取数据，抓取多个Url