Bloom Filter Support for Scrapy-Redis
Project description
Scrapy-Redis-BloomFilter
This is a package for supporting BloomFilter of Scrapy-Redis.
Installation
You can easily install this package with pip:
pip install scrapy-redis-bloomfilter
Dependency:
- Scrapy-Redis >= 0.6.8
Usage
Add this settings to settings.py
:
# Ensure use this Scheduler
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"
# Redis URL
REDIS_URL = 'redis://localhost:6379'
# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6
# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 10
# Persist
SCHEDULER_PERSIST = True
Test
Here is a test of this project, usage:
git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
cd ScrapyRedisBloomFilter/test
scrapy crawl test
Note: please change REDIS_URL in settings.py.
Spider like this:
from scrapy import Request, Spider
class TestSpider(Spider):
name = 'test'
base_url = 'https://www.baidu.com/s?wd='
def start_requests(self):
for i in range(10):
url = self.base_url + str(i)
yield Request(url, callback=self.parse)
# Here contains 10 duplicated Requests
for i in range(100):
url = self.base_url + str(i)
yield Request(url, callback=self.parse)
def parse(self, response):
self.logger.debug('Response of ' + response.url)
Result like this:
{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
'downloader/request_bytes': 34021,
'downloader/request_count': 100,
'downloader/request_method_count/GET': 100,
'downloader/response_bytes': 72943,
'downloader/response_count': 100,
'downloader/response_status_count/200': 100,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
'log_count/DEBUG': 202,
'log_count/INFO': 7,
'memusage/max': 54153216,
'memusage/startup': 54153216,
'response_received_count': 100,
'scheduler/dequeued/redis': 100,
'scheduler/enqueued/redis': 100,
'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for Scrapy-Redis-BloomFilter-0.8.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53f769bf2c3a858aa0e8e19ff66f54557d22526f4fb7bb6219e1d9183db7beee |
|
MD5 | 4fe7039aa027a5c1b62b0a7cab9b365f |
|
BLAKE2b-256 | 739b40be6b7e3732928e17291040a7a5ad1e59c3a1e542789b9e4c9d18190d49 |
Close
Hashes for Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8da71c092081f91f2091f93e0f240bef1869be69293138a643d1eb69fa55383f |
|
MD5 | d8f64ceefdd0cb3f10e075c4968f3473 |
|
BLAKE2b-256 | 5110c4cb99217b54fe769e8a45c75be19a400d196001d58d3bd0b53e5753cc34 |