如何在单个 Scrapy 项目中为不同的蜘蛛使用不同的管道

Posted 2023-02-15

技术标签:

【中文标题】如何在单个 Scrapy 项目中为不同的蜘蛛使用不同的管道【英文标题】：How can I use different pipelines for different spiders in a single Scrapy project 【发布时间】：2012-01-12 10:46:33 【问题描述】：

我有一个包含多个蜘蛛的scrapy 项目。有什么方法可以定义哪些管道用于哪个蜘蛛？并非我定义的所有管道都适用于每个蜘蛛。

谢谢

【问题讨论】：

感谢您提出的非常好的问题。请为所有未来的 Google 员工选择一个答案。 mstringer 提供的答案对我来说效果很好。 【参考方案1】：

最简单有效的解决方案是在每个蜘蛛本身中设置自定义设置。

custom_settings = 'ITEM_PIPELINES': 'project_name.pipelines.SecondPipeline': 300

之后需要在settings.py文件中设置

ITEM_PIPELINES = 
   'project_name.pipelines.FistPipeline': 300,
   'project_name.pipelines.SecondPipeline': 400

这样每个蜘蛛都会使用各自的管道。

【讨论】：

截至 2020 年，这是该问题最干净的解决方案。【参考方案2】：

简单但仍然有用的解决方案。

蜘蛛码

    def parse(self, response):
        item = 
        ... do parse stuff
        item['info'] = 'spider': 'Spider2'

管道代码

    def process_item(self, item, spider):
        if item['info']['spider'] == 'Spider1':
            logging.error('Spider1 pipeline works')
        elif item['info']['spider'] == 'Spider2':
            logging.error('Spider2 pipeline works')
        elif item['info']['spider'] == 'Spider3':
            logging.error('Spider3 pipeline works')

希望这可以为某人节省一些时间！

【讨论】：

这不能很好地扩展，并且还会使代码混乱。将职责混合在一起。【参考方案3】：

您可以像这样在蜘蛛内部设置项目管道设置：

class CustomSpider(Spider):
    name = 'custom_spider'
    custom_settings = 
        'ITEM_PIPELINES': 
            '__main__.PagePipeline': 400,
            '__main__.ProductPipeline': 300,
        ,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2

然后我可以通过向加载器/返回的项目添加一个值来拆分管道（甚至使用多个管道），该值标识蜘蛛的哪个部分发送了项目。这样我就不会得到任何 KeyError 异常，并且我知道哪些项目应该可用。

    ...
    def scrape_stuff(self, response):
        pageloader = PageLoader(
                PageItem(), response=response)

        pageloader.add_xpath('entire_page', '/html//text()')
        pageloader.add_value('item_type', 'page')
        yield pageloader.load_item()

        productloader = ProductLoader(
                ProductItem(), response=response)

        productloader.add_xpath('product_name', '//span[contains(text(), "Example")]')
        productloader.add_value('item_type', 'product')
        yield productloader.load_item()

class PagePipeline:
    def process_item(self, item, spider):
        if item['item_type'] == 'product':
            # do product stuff

        if item['item_type'] == 'page':
            # do page stuff

【讨论】：

这应该是公认的答案。更灵活，更省力【参考方案4】：

我们可以像这样在管道中使用一些条件

    # -*- coding: utf-8 -*-
from scrapy_app.items import x

class SaveItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, x,):
            item.save()
        return item

【讨论】：

【参考方案5】：

这里给出的其他解决方案很好，但我认为它们可能会很慢，因为我们并不是真的不使用每个蜘蛛的管道，而是在每次项目时检查管道是否存在被退回（在某些情况下可能达到数百万）。

完全禁用（或启用）每个蜘蛛的功能的一个好方法是使用custom_setting 和from_crawler 用于所有扩展，如下所示：

pipelines.py

from scrapy.exceptions import NotConfigured

class SomePipeline(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
            # if this isn't specified in settings, the pipeline will be completely disabled
            raise NotConfigured
        return cls()

    def process_item(self, item, spider):
        # change my item
        return item

settings.py

ITEM_PIPELINES = 
   'myproject.pipelines.SomePipeline': 300,

SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):

    name = 'spider1'

    start_urls = ["http://example.com"]

    custom_settings = 
        'SOMEPIPELINE_ENABLED': False

如您所见，我们已指定 custom_settings，它将覆盖 settings.py 中指定的内容，并且我们正在为此蜘蛛禁用 SOMEPIPELINE_ENABLED。

现在当你运行这个蜘蛛时，检查类似的东西：

[scrapy] INFO: Enabled item pipelines: []

现在，scrapy 已经完全禁用了管道，在整个运行过程中都不会担心它的存在。检查这是否也适用于scrapy extensions 和middlewares。

【讨论】：

【参考方案6】：

我正在使用两个管道，一个用于图像下载（MyImagesPipeline），第二个用于将数据保存到 mongodb（MongoPipeline）。

假设我们有很多蜘蛛（spider1，spider2，............），在我的例子中，spider1 和 spider5 不能使用 MyImagesPipeline

settings.py

ITEM_PIPELINES = 'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2
IMAGES_STORE = '/var/www/scrapycrawler/dowload'

下面是管道的完整代码

import scrapy
import string
import pymongo
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline):
    def process_item(self, item, spider):
        if spider.name not in ['spider1', 'spider5']:
            return super(ImagesPipeline, self).process_item(item, spider)
        else:
           return item 

    def file_path(self, request, response=None, info=None):
        image_name = string.split(request.url, '/')[-1]
        dir1 = image_name[0]
        dir2 = image_name[1]
        return dir1 + '/' + dir2 + '/' +image_name

class MongoPipeline(object):

    collection_name = 'scrapy_items'
    collection_url='snapdeal_urls'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        #self.db[self.collection_name].insert(dict(item))
        collection_name=item.get( 'collection_name', self.collection_name )
        self.db[collection_name].insert(dict(item))
        data = 
        data['base_id'] = item['base_id']
        self.db[self.collection_url].update(
            'base_id': item['base_id']
        , 
            '$set': 
            'image_download': 1
            
        , upsert=False, multi=True)
        return item

【讨论】：

【参考方案7】：

只需从主要设置中删除所有管道并在蜘蛛内部使用它。

这将定义每个蜘蛛用户的管道

class testSpider(InitSpider):
    name = 'test'
    custom_settings = 
        'ITEM_PIPELINES': 
            'app.MyPipeline': 400

【讨论】：

对于那些想知道“400”是什么的人？像我一样-来自文档-“您在此设置中分配给类的整数值决定了它们运行的顺序：项目从低值到高值的类。习惯上在 0-1000 范围内定义这些数字” - docs.scrapy.org/en/latest/topics/item-pipeline.html 不知道为什么这不是公认的答案，它完美地工作，比公认的答案更干净、更简单。这正是我一直在寻找的。仍在scrapy 1.8中工作刚刚签入了scrapy 1.6。无需删除 settings.py 中的管道设置。蜘蛛中的 custom_settings 覆盖了 settings.py 中的管道设置。非常适合我的场景！ for 'app.MyPipeline' 替换管道类的全名。例如，project.pipelines.MyPipeline，其中 project 是项目的名称，pipes 是 pipelines.py 文件，MyPipeline 是 Pipeline 类【参考方案8】：

在the solution from Pablo Hoffman 的基础上，您可以在Pipeline 对象的process_item 方法上使用以下装饰器，以便检查蜘蛛的pipeline 属性是否应该执行。例如：

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

为了让这个装饰器正常工作，蜘蛛必须有一个管道属性，其中包含您要用来处理项目的管道对象的容器，例如：

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

然后在pipelines.py 文件中：

class Save(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do saving here
        return item

class Validate(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do validating here
        return item

所有 Pipeline 对象仍应在设置中的 ITEM_PIPELINES 中定义（以正确的顺序 - 最好更改以便在 Spider 上也可以指定顺序）。

【讨论】：

我正在尝试实现您在管道之间切换的方式，但我得到了 NameError！我得到管道没有定义。你自己测试过这段代码吗？你会帮我吗？ .@mehdix_ 是的，它对我有用。你从哪里得到一个 NameError？错误出现在scrapy crawl <spider name> 命令之后。 python 无法识别我在蜘蛛类中设置的名称以便管道运行。我会给你链接到我的spider.py 和pipeline.py 让你看看。谢谢感谢您的澄清。第一个代码 sn-p 去哪里了？ spider.py 末尾的某处对吗？我在已经定义的没有设置管道的蜘蛛上编辑了不失败的条件，这也将使它默认执行所有管道，除非另有说明。 if not hasattr(spider, 'pipeline') or self.__class__ in spider.pipeline:【参考方案9】：

您可以在管道中使用蜘蛛的name 属性

class CustomPipeline(object)

    def process_item(self, item, spider)
         if spider.name == 'spider1':
             # do something
             return item
         return item

以这种方式定义所有管道可以完成您想要的。

【讨论】：

【参考方案10】：

我能想到至少四种方法：

scrapy settings

default_settings['ITEM_PIPELINES']

process_item()

【讨论】：

感谢您的回复。我使用的是方法 1，但我觉得有一个项目更干净，并且允许我重用代码。您能否详细说明方法 3。我如何将蜘蛛隔离到它们自己的工具命令中？根据另一个答案上发布的链接，您不能覆盖管道，所以我猜数字 3 行不通。您能帮我一下吗？ ***.com/questions/25353650/…

以上是关于如何在单个 Scrapy 项目中为不同的蜘蛛使用不同的管道的主要内容，如果未能解决你的问题，请参考以下文章