Scrapy 扩展中间件: 针对特定响应状态码,使用代理重新请求

Posted my8100

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy 扩展中间件: 针对特定响应状态码,使用代理重新请求相关的知识,希望对你有一定的参考价值。

0.参考

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy

1.主要实现

实际爬虫过程中如果请求过于频繁,通常会被临时重定向到登录页面即302,甚至是提示禁止访问即403,因此可以对这些响应执行一次代理请求:

(1) 参考原生 redirect.py 模块,满足 dont_redirect 或 handle_httpstatus_list 等条件时,直接传递 response

(2) 不满足条件(1),如果响应状态码为 302 或 403,使用代理重新发起请求

(3) 使用代理后,如果响应状态码仍为 302 或 403,直接丢弃

 

2.代码实现

保存至 /site-packages/my_middlewares.py

from w3lib.url import safe_url_string
from six.moves.urllib.parse import urljoin

from scrapy.exceptions import IgnoreRequest


class MyAutoProxyDownloaderMiddleware(object):

    def __init__(self, settings):
        self.proxy_status = settings.get(‘PROXY_STATUS‘, [302, 403])
        # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy
        self.proxy_config = settings.get(‘PROXY_CONFIG‘, ‘http://username:[email protected]_proxy_server:port‘)


    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            settings = crawler.settings
        )        


    # See /site-packages/scrapy/downloadermiddlewares/redirect.py
    def process_response(self, request, response, spider):
        if (request.meta.get(dont_redirect, False) or
                response.status in getattr(spider, handle_httpstatus_list, []) or
                response.status in request.meta.get(handle_httpstatus_list, []) or
                request.meta.get(handle_httpstatus_all, False)):
            return response

        if response.status in self.proxy_status:
            if Location in response.headers:
                location = safe_url_string(response.headers[location])
                redirected_url = urljoin(request.url, location)
            else:
                redirected_url = ‘‘
                
            # AutoProxy for first time
            if not request.meta.get(auto_proxy):
                request.meta.update({‘auto_proxy‘: True, ‘proxy‘: self.proxy_config})
                new_request = request.replace(meta=request.meta, dont_filter=True)
                new_request.priority = request.priority + 2
                
                spider.log(Will AutoProxy for <{} {}> {}.format(
                            response.status, request.url, redirected_url))
                return new_request
            
            # IgnoreRequest for second time
            else:
                spider.logger.warn(Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy.format(
                                    response.status, request.url, self.proxy_status))
                raise IgnoreRequest

        return response

 

3.调用方法

(1) 项目 settings.py 添加代码,注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘: 600,
    my_middlewares.MyAutoProxyDownloaderMiddleware: 601,
    # ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware‘: 750,    
}
PROXY_STATUS = [302, 403]
PROXY_CONFIG = http://username:[email protected]_proxy_server:port

 

4.运行结果

2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1
2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403>
2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy
2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy

 

代理服务器 log:

squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT
squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT

 

以上是关于Scrapy 扩展中间件: 针对特定响应状态码,使用代理重新请求的主要内容,如果未能解决你的问题,请参考以下文章

9. http协议_响应状态码_页面渲染流程_路由_中间件

Scrapy踩坑:请求无响应,requests正常

在身份验证失败时返回自定义状态码 .Net Core

scrapy框架中setting中中间件数字大的先执行么

使用 Laravel 和 Passport 在身份验证失败时响应状态码 401?

学明白