Skip to main content

Bloom Filter Support for Scrapy-Redis

Project description

Scrapy-Redis-BloomFilter

This is a package for supporting BloomFilter of Scrapy-Redis.

Installation

You can easily install this package with pip:

pip install scrapy-redis-bloomfilter

Dependency:

  • Scrapy-Redis >= 0.6.8

Usage

Add this settings to settings.py:

# Ensure use this Scheduler
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# Redis URL
REDIS_URL = 'redis://localhost:6379'

# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6

# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 10

# Persist
SCHEDULER_PERSIST = True

Test

Here is a test of this project, usage:

git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
cd ScrapyRedisBloomFilter/test
scrapy crawl test

Note: please change REDIS_URL in settings.py.

Spider like this:

from scrapy import Request, Spider

class TestSpider(Spider):
    name = 'test'
    base_url = 'https://www.baidu.com/s?wd='

    def start_requests(self):
        for i in range(10):
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

        # Here contains 10 duplicated Requests    
        for i in range(100): 
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

    def parse(self, response):
        self.logger.debug('Response of ' + response.url)

Result like this:

{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
 'downloader/request_bytes': 34021,
 'downloader/request_count': 100,
 'downloader/request_method_count/GET': 100,
 'downloader/response_bytes': 72943,
 'downloader/response_count': 100,
 'downloader/response_status_count/200': 100,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
 'log_count/DEBUG': 202,
 'log_count/INFO': 7,
 'memusage/max': 54153216,
 'memusage/startup': 54153216,
 'response_received_count': 100,
 'scheduler/dequeued/redis': 100,
 'scheduler/enqueued/redis': 100,
 'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for Scrapy-Redis-BloomFilter, version 0.8.1
Filename, size File type Python version Upload date Hashes
Filename, size Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl (6.3 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size Scrapy-Redis-BloomFilter-0.8.1.tar.gz (5.7 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page