.去重url,爬取和去重分离

Posted traditional

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了.去重url,爬取和去重分离相关的知识,希望对你有一定的参考价值。

# 新建py文件:duplication.py

# 我们新建了一个文件,专门用来去重。在scrapy源码中已经把结构写好了,我们只需复制粘贴过来
from scrapy.dupefilter import BaseDupeFilter
‘‘‘
class BaseDupeFilter(object):

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        return False

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass
‘‘‘
# 可以看到,以上就是scrapy中BaseDupeFilter这个类,框架结构帮我们搭好了,因此我们只需要自定制以下即可


class DupeFilter(object):

    # 使用构造方法,还是用之前的过滤方法
    def __init__(self):
        self.urls = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        # 这里的request.url就是我们爬取的url
        # 如果在集合里面,那么返回True,意思是成功了不用再爬了
        if request.url in self.urls:
            return True

        # 不再集合里面返回False,意思是错误,虫子还没有爬取此url
        self.urls.add(request.url)
        return False

    def open(self):  # 开始
        pass

    def close(self, reason):  # 结束
        pass

    def log(self, request, spider):  # 记录日志
        pass

# 可以看到@classmethod下的类方法,直接返回cls(),这在scrapy中非常常见,因此我们不用实例化
# scrapy会自动地调用这个方法,生成一个实例对象,因此我们只需要写好相应的结构即可

主程序:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request



class GetChoutiSpider(scrapy.Spider):
    name = ‘get_chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    # # 当递归查找时,会反复执行parse,因此md5_urls不能定义在parse函数里面
    # md5_urls = set()
    # 将url添加到集合中,是我们自己自定制的方法,其实scrapy为我们准备了更好的去重方法

    def parse(self, response):
        # 通过返回结果,我们可以看到确实scrapy帮我们去重了
        print(response.url)
        ‘‘‘
        https://dig.chouti.com/
        https://dig.chouti.com/all/hot/recent/2
        https://dig.chouti.com/all/hot/recent/10
        https://dig.chouti.com/all/hot/recent/8
        https://dig.chouti.com/all/hot/recent/6
        https://dig.chouti.com/all/hot/recent/9
        https://dig.chouti.com/all/hot/recent/4
        https://dig.chouti.com/all/hot/recent/5
        https://dig.chouti.com/all/hot/recent/7
        https://dig.chouti.com/all/hot/recent/3
        https://dig.chouti.com/all/hot/recent/1
        https://dig.chouti.com/all/hot/recent/11
        https://dig.chouti.com/all/hot/recent/12
        https://dig.chouti.com/all/hot/recent/14
        https://dig.chouti.com/all/hot/recent/13
        https://dig.chouti.com/all/hot/recent/18
        https://dig.chouti.com/all/hot/recent/16
        https://dig.chouti.com/all/hot/recent/17
        https://dig.chouti.com/all/hot/recent/15
        https://dig.chouti.com/all/hot/recent/19
        https://dig.chouti.com/all/hot/recent/20
        https://dig.chouti.com/all/hot/recent/21
        https://dig.chouti.com/all/hot/recent/23
        https://dig.chouti.com/all/hot/recent/25
        https://dig.chouti.com/all/hot/recent/24
        https://dig.chouti.com/all/hot/recent/27
        https://dig.chouti.com/all/hot/recent/29
        https://dig.chouti.com/all/hot/recent/26
        https://dig.chouti.com/all/hot/recent/28
        https://dig.chouti.com/all/hot/recent/22
        https://dig.chouti.com/all/hot/recent/30
        https://dig.chouti.com/all/hot/recent/33
        https://dig.chouti.com/all/hot/recent/31
        https://dig.chouti.com/all/hot/recent/32
        https://dig.chouti.com/all/hot/recent/34
        https://dig.chouti.com/all/hot/recent/37
        https://dig.chouti.com/all/hot/recent/36
        https://dig.chouti.com/all/hot/recent/41
        https://dig.chouti.com/all/hot/recent/38
        https://dig.chouti.com/all/hot/recent/40
        https://dig.chouti.com/all/hot/recent/39
        https://dig.chouti.com/all/hot/recent/45
        https://dig.chouti.com/all/hot/recent/42
        https://dig.chouti.com/all/hot/recent/44
        https://dig.chouti.com/all/hot/recent/43
        https://dig.chouti.com/all/hot/recent/49
        https://dig.chouti.com/all/hot/recent/47
        https://dig.chouti.com/all/hot/recent/46
        https://dig.chouti.com/all/hot/recent/48
        https://dig.chouti.com/all/hot/recent/50
        https://dig.chouti.com/all/hot/recent/53
        https://dig.chouti.com/all/hot/recent/51
        https://dig.chouti.com/all/hot/recent/52
        https://dig.chouti.com/all/hot/recent/56
        https://dig.chouti.com/all/hot/recent/57
        https://dig.chouti.com/all/hot/recent/55
        https://dig.chouti.com/all/hot/recent/35
        https://dig.chouti.com/all/hot/recent/54
        https://dig.chouti.com/all/hot/recent/59
        https://dig.chouti.com/all/hot/recent/60
        https://dig.chouti.com/all/hot/recent/61
        https://dig.chouti.com/all/hot/recent/58
        https://dig.chouti.com/all/hot/recent/62
        https://dig.chouti.com/all/hot/recent/63
        https://dig.chouti.com/all/hot/recent/64
        https://dig.chouti.com/all/hot/recent/65
        https://dig.chouti.com/all/hot/recent/66
        https://dig.chouti.com/all/hot/recent/67
        https://dig.chouti.com/all/hot/recent/68
        https://dig.chouti.com/all/hot/recent/69
        https://dig.chouti.com/all/hot/recent/70
        https://dig.chouti.com/all/hot/recent/71
        https://dig.chouti.com/all/hot/recent/73
        https://dig.chouti.com/all/hot/recent/72
        https://dig.chouti.com/all/hot/recent/74
        https://dig.chouti.com/all/hot/recent/76
        https://dig.chouti.com/all/hot/recent/75
        https://dig.chouti.com/all/hot/recent/77
        https://dig.chouti.com/all/hot/recent/78
        https://dig.chouti.com/all/hot/recent/79
        https://dig.chouti.com/all/hot/recent/80
        https://dig.chouti.com/all/hot/recent/81
        https://dig.chouti.com/all/hot/recent/82
        https://dig.chouti.com/all/hot/recent/83
        https://dig.chouti.com/all/hot/recent/84
        https://dig.chouti.com/all/hot/recent/85
        https://dig.chouti.com/all/hot/recent/86
        https://dig.chouti.com/all/hot/recent/87
        https://dig.chouti.com/all/hot/recent/88
        https://dig.chouti.com/all/hot/recent/89
        https://dig.chouti.com/all/hot/recent/90
        https://dig.chouti.com/all/hot/recent/92
        https://dig.chouti.com/all/hot/recent/91
        https://dig.chouti.com/all/hot/recent/93
        https://dig.chouti.com/all/hot/recent/94
        https://dig.chouti.com/all/hot/recent/97
        https://dig.chouti.com/all/hot/recent/95
        https://dig.chouti.com/all/hot/recent/96
        https://dig.chouti.com/all/hot/recent/98
        https://dig.chouti.com/all/hot/recent/99
        https://dig.chouti.com/all/hot/recent/100
        https://dig.chouti.com/all/hot/recent/101
        https://dig.chouti.com/all/hot/recent/102
        https://dig.chouti.com/all/hot/recent/103
        https://dig.chouti.com/all/hot/recent/104
        https://dig.chouti.com/all/hot/recent/105
        https://dig.chouti.com/all/hot/recent/108
        https://dig.chouti.com/all/hot/recent/106
        https://dig.chouti.com/all/hot/recent/107
        https://dig.chouti.com/all/hot/recent/109
        https://dig.chouti.com/all/hot/recent/111
        https://dig.chouti.com/all/hot/recent/110
        https://dig.chouti.com/all/hot/recent/112
        https://dig.chouti.com/all/hot/recent/113
        https://dig.chouti.com/all/hot/recent/114
        https://dig.chouti.com/all/hot/recent/115
        https://dig.chouti.com/all/hot/recent/116
        https://dig.chouti.com/all/hot/recent/117
        https://dig.chouti.com/all/hot/recent/120
        https://dig.chouti.com/all/hot/recent/118
        https://dig.chouti.com/all/hot/recent/119
        ‘‘‘
        # 这里我们要如何去重呢?新建一个文件定义一个类
        res2 = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        for url in res2:
            # 之间的统统都可以不要了
            url = "https://dig.chouti.com%s" % url
            yield Request(url=url, callback=self.parse)

配置文件:

DEPTH_LIMIT = 0

# 当然在配置文件里,必须指定一下,过滤所用到的类
# 这样才会用我们定义的类进行过滤
DUPEFILTER_CLASS = ‘chouti.duplication.DupeFilter‘

  

以上是关于.去重url,爬取和去重分离的主要内容,如果未能解决你的问题,请参考以下文章

位图:爬虫URL去重最佳方案

位图:爬虫URL去重最佳方案

位图:爬虫URL去重最佳方案

大数据操作:删除和去重

冒泡排序和去重

数组扁平和去重