downloader middleware的三个methods不同返回的情况

Posted --here--gold--you--want

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了downloader middleware的三个methods不同返回的情况相关的知识,希望对你有一定的参考价值。

要激活一个meddleware, 要在设置里面添加。例如:

DOWNLOADER_MIDDLEWARES = {
    myproject.middlewares.CustomDownloaderMiddleware: 543,
}

key是要激活的middleware的路径, value是它的value。其实scrapy本身就已经内置了很多middleware,所以在激活一个自己编写的middleware的时候,要在文档中查找默认的middleware的序号,以便把自己的middleware插入到正确的位置。

默认的middleware如下:

{
    scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware: 100,
    scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware: 300,
    scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware: 350,
    scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware: 400,
    scrapy.downloadermiddlewares.useragent.UserAgentMiddleware: 500,
    scrapy.downloadermiddlewares.retry.RetryMiddleware: 550,
    scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware: 560,
    scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware: 580,
    scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware: 590,
    scrapy.downloadermiddlewares.redirect.RedirectMiddleware: 600,
    scrapy.downloadermiddlewares.cookies.CookiesMiddleware: 700,
    scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware: 750,
    scrapy.downloadermiddlewares.stats.DownloaderStats: 850,
    scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware: 900,
}

序号越小的middleware越接近engine,越大的越靠近downloader。

每一个downloader最多只能有四个methods。分别是:process_requests, process_response,process_exception和from_crawler。我们编写的downloader至少要有其中一个。

在engine发送requests给downloader这条路上,对于这个request,会依次调用所有middlware对它进行处理。(序号由小到大)

在downloader发送response给engine这条路上,对于这个response,会依次调用所有middleware对它进行处理。(序号由大到小)

 

 

下面是对这四个方法的介绍:

process_request(request, spider)

Parameters
request (Request object) – the request being processed

spider (Spider object) – the spider for which this request is intended

process_resquest可以 return None, return a Response object, return a Request object, or raise IgnoreRequest.

  1. 返回None:把这个修改过的request继续传递下去。
  2. 返回Response对象:会依次调用所有middleware的process_response对它进行处理。(序号由大到小)(不再进入downloader)
  3. 返回Requests对象:把这个新的request放到调度队列的开头,即放到engine到downloader这条路的开头,会依次调用所有middlware的process_request对它进行处理。(序号由小到大)
  4. 返回IgnoreRequest:会依次调用所有middleware的process_exception对它进行处理。(序号由大到小)。如果一个process_exception都没有,就会回调Request的errorback函数。如果又没有,这个错误就会被忽略。

 

process_response(request, response, spider)

Parameters
request (is a Request object) – the request that originated the response

response (Response object) – the response being processed

spider (Spider object) – the spider for which this response is intended

process_response可以 return a Response object, return a Request object or raise a IgnoreRequest exception.

  1. 返回Response对象: 这个修改或没有修改过的response会依次被剩下的middleware的process_response调用,继续走向engine.
  2. 返回Request对象:把这个request放到engine中执行调度,走向downloader。
  3. 返回IgnoreRequest:回调Request的errorback函数。如果没有,这个错误就会被忽略。

 

process_exception(request, exception, spider)

Parameters
request (is a Request object) – the request that generated the exception

exception (an Exception object) – the raised exception

spider (Spider object) – the spider for which this request is intended

process_exception可以return: either None, a Response object, or a Request object.

  1. 返回None: 继续在剩下的middleware中传递这个exception、执行process_exception方法。
  2. 返回Response对象:把这个response放到downloader中执行调度,走向engine。
  3. 返回Request对象:把这个request放到engine中执行调度,走向downloader。

 

from_crawler(cls, crawler)
If present, this classmethod is called to create a middleware instance from a Crawler. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.

Parameters
crawler (Crawler object) – crawler that uses this middleware

 

以上是关于downloader middleware的三个methods不同返回的情况的主要内容,如果未能解决你的问题,请参考以下文章

爬虫框架Scrapy之Downloader Middlewares

scrapy | downloader middleware

scrapy下载中间件(downloader middleware)和蜘蛛中间件(spider middleware)

python爬虫人门Scrapy框架之Downloader Middlewares

Download Middlewares的作用?如何自定义ItemPipeline,写出需要实现的

Python爬虫从入门到放弃 之 Scrapy框架中Download Middleware用法