Skip to main content

Bloom Filter Support for Scrapy-Redis

Project description

Scrapy-Redis-BloomFilter

This is a package for supporting BloomFilter of Scrapy-Redis.

Installation

You can easily install this package with pip:

pip install scrapy-redis-bloomfilter

Dependency:

  • Scrapy-Redis >= 0.6.8

Usage

Add this settings to settings.py:

# Ensure use this Scheduler
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# Redis URL
REDIS_URL = 'redis://localhost:6379'

# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6

# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 10

# Persist
SCHEDULER_PERSIST = True

Test

Here is a test of this project, usage:

git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
cd ScrapyRedisBloomFilter/test
scrapy crawl test

Note: please change REDIS_URL in settings.py.

Spider like this:

from scrapy import Request, Spider

class TestSpider(Spider):
    name = 'test'
    base_url = 'https://www.baidu.com/s?wd='

    def start_requests(self):
        for i in range(10):
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

        # Here contains 10 duplicated Requests    
        for i in range(100): 
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

    def parse(self, response):
        self.logger.debug('Response of ' + response.url)

Result like this:

{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
 'downloader/request_bytes': 34021,
 'downloader/request_count': 100,
 'downloader/request_method_count/GET': 100,
 'downloader/response_bytes': 72943,
 'downloader/response_count': 100,
 'downloader/response_status_count/200': 100,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
 'log_count/DEBUG': 202,
 'log_count/INFO': 7,
 'memusage/max': 54153216,
 'memusage/startup': 54153216,
 'response_received_count': 100,
 'scheduler/dequeued/redis': 100,
 'scheduler/enqueued/redis': 100,
 'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Scrapy-Redis-BloomFilter-0.8.1.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl (6.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Scrapy-Redis-BloomFilter-0.8.1.tar.gz.

File metadata

  • Download URL: Scrapy-Redis-BloomFilter-0.8.1.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.6

File hashes

Hashes for Scrapy-Redis-BloomFilter-0.8.1.tar.gz
Algorithm Hash digest
SHA256 53f769bf2c3a858aa0e8e19ff66f54557d22526f4fb7bb6219e1d9183db7beee
MD5 4fe7039aa027a5c1b62b0a7cab9b365f
BLAKE2b-256 739b40be6b7e3732928e17291040a7a5ad1e59c3a1e542789b9e4c9d18190d49

See more details on using hashes here.

File details

Details for the file Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl.

File metadata

  • Download URL: Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.6

File hashes

Hashes for Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8da71c092081f91f2091f93e0f240bef1869be69293138a643d1eb69fa55383f
MD5 d8f64ceefdd0cb3f10e075c4968f3473
BLAKE2b-256 5110c4cb99217b54fe769e8a45c75be19a400d196001d58d3bd0b53e5753cc34

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page