Python爬虫学习——布隆过滤器
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python爬虫学习——布隆过滤器相关的知识,希望对你有一定的参考价值。
布隆过滤器的实现方法1:自己实现
参考 http://www.cnblogs.com/naive/p/5815433.html
bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数
#coding:utf-8 #!/usr/bin/env python from bitarray import bitarray # 3rd party import mmh3 import scrapy from BeautifulSoup import BeautifulSoup as BS import os ls = os.linesep class BloomFilter(set): def __init__(self, size, hash_count): super(BloomFilter, self).__init__() self.bit_array = bitarray(size) self.bit_array.setall(0) self.size = size self.hash_count = hash_count def __len__(self): return self.size def __iter__(self): return iter(self.bit_array) def add(self, item): for ii in range(self.hash_count): index = mmh3.hash(item, ii) % self.size self.bit_array[index] = 1 return self def __contains__(self, item): out = True for ii in range(self.hash_count): index = mmh3.hash(item, ii) % self.size if self.bit_array[index] == 0: out = False return out class DmozSpider(scrapy.Spider): name = "baidu" allowed_domains = ["baidu.com"] start_urls = [ "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0" ] def parse(self, response): # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp" # # html = response.xpath(‘//html‘).extract()[0] # fobj = open(fname, ‘w‘) # fobj.writelines(html.encode(‘utf-8‘)) # fobj.close() bloom = BloomFilter(1000, 10) animals = [‘dog‘, ‘cat‘, ‘giraffe‘, ‘fly‘, ‘mosquito‘, ‘horse‘, ‘eagle‘, ‘bird‘, ‘bison‘, ‘boar‘, ‘butterfly‘, ‘ant‘, ‘anaconda‘, ‘bear‘, ‘chicken‘, ‘dolphin‘, ‘donkey‘, ‘crow‘, ‘crocodile‘] # First insertion of animals into the bloom filter for animal in animals: bloom.add(animal) # Membership existence for already inserted animals # There should not be any false negatives for animal in animals: if animal in bloom: print(‘{} is in bloom filter as expected‘.format(animal)) else: print(‘Something is terribly went wrong for {}‘.format(animal)) print(‘FALSE NEGATIVE!‘) # Membership existence for not inserted animals # There could be false positives other_animals = [‘badger‘, ‘cow‘, ‘pig‘, ‘sheep‘, ‘bee‘, ‘wolf‘, ‘fox‘, ‘whale‘, ‘shark‘, ‘fish‘, ‘turkey‘, ‘duck‘, ‘dove‘, ‘deer‘, ‘elephant‘, ‘frog‘, ‘falcon‘, ‘goat‘, ‘gorilla‘, ‘hawk‘] for other_animal in other_animals: if other_animal in bloom: print(‘{} is not in the bloom, but a false positive‘.format(other_animal)) else: print(‘{} is not in the bloom filter as expected‘.format(other_animal))
布隆过滤器的实现方法2:使用pybloom
参考 http://www.jianshu.com/p/f57187e2b5b9
#coding:utf-8 #!/usr/bin/env python from pybloom import BloomFilter import scrapy from BeautifulSoup import BeautifulSoup as BS import os ls = os.linesep class DmozSpider(scrapy.Spider): name = "baidu" allowed_domains = ["baidu.com"] start_urls = [ "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0" ] def parse(self, response): # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp" # # html = response.xpath(‘//html‘).extract()[0] # fobj = open(fname, ‘w‘) # fobj.writelines(html.encode(‘utf-8‘)) # fobj.close() # bloom = BloomFilter(100, 10) bloom = BloomFilter(1000, 0.001) animals = [‘dog‘, ‘cat‘, ‘giraffe‘, ‘fly‘, ‘mosquito‘, ‘horse‘, ‘eagle‘, ‘bird‘, ‘bison‘, ‘boar‘, ‘butterfly‘, ‘ant‘, ‘anaconda‘, ‘bear‘, ‘chicken‘, ‘dolphin‘, ‘donkey‘, ‘crow‘, ‘crocodile‘] # First insertion of animals into the bloom filter for animal in animals: bloom.add(animal) # Membership existence for already inserted animals # There should not be any false negatives for animal in animals: if animal in bloom: print(‘{} is in bloom filter as expected‘.format(animal)) else: print(‘Something is terribly went wrong for {}‘.format(animal)) print(‘FALSE NEGATIVE!‘) # Membership existence for not inserted animals # There could be false positives other_animals = [‘badger‘, ‘cow‘, ‘pig‘, ‘sheep‘, ‘bee‘, ‘wolf‘, ‘fox‘, ‘whale‘, ‘shark‘, ‘fish‘, ‘turkey‘, ‘duck‘, ‘dove‘, ‘deer‘, ‘elephant‘, ‘frog‘, ‘falcon‘, ‘goat‘, ‘gorilla‘, ‘hawk‘] for other_animal in other_animals: if other_animal in bloom: print(‘{} is not in the bloom, but a false positive‘.format(other_animal)) else: print(‘{} is not in the bloom filter as expected‘.format(other_animal))
输出
dog is in bloom filter as expected cat is in bloom filter as expected giraffe is in bloom filter as expected fly is in bloom filter as expected mosquito is in bloom filter as expected horse is in bloom filter as expected eagle is in bloom filter as expected bird is in bloom filter as expected bison is in bloom filter as expected boar is in bloom filter as expected butterfly is in bloom filter as expected ant is in bloom filter as expected anaconda is in bloom filter as expected bear is in bloom filter as expected chicken is in bloom filter as expected dolphin is in bloom filter as expected donkey is in bloom filter as expected crow is in bloom filter as expected crocodile is in bloom filter as expected badger is not in the bloom filter as expected cow is not in the bloom filter as expected pig is not in the bloom filter as expected sheep is not in the bloom filter as expected bee is not in the bloom filter as expected wolf is not in the bloom filter as expected fox is not in the bloom filter as expected whale is not in the bloom filter as expected shark is not in the bloom filter as expected fish is not in the bloom filter as expected turkey is not in the bloom filter as expected duck is not in the bloom filter as expected dove is not in the bloom filter as expected deer is not in the bloom filter as expected elephant is not in the bloom filter as expected frog is not in the bloom filter as expected falcon is not in the bloom filter as expected goat is not in the bloom filter as expected gorilla is not in the bloom filter as expected hawk is not in the bloom filter as expected
以上是关于Python爬虫学习——布隆过滤器的主要内容,如果未能解决你的问题,请参考以下文章
第三百五十八节,Python分布式爬虫打造搜索引擎Scrapy精讲—将bloomfilter(布隆过滤器)集成到scrapy-redis中