Bloom Filter Support for Scrapy-Redis
Project description
Scrapy-Redis-BloomFilter
This is a package for supporting BloomFilter of Scrapy-Redis.
Installation
You can easily install this package with pip:
pip install scrapy-redis-bloomfilter
Dependency:
- Scrapy-Redis >= 0.6.8
Usage
Add this settings to settings.py
:
# Ensure use this Scheduler
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"
# Redis URL
REDIS_URL = 'redis://localhost:6379'
# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6
# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 10
# Persist
SCHEDULER_PERSIST = True
Test
Here is a test of this project, usage:
git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
cd ScrapyRedisBloomFilter/test
scrapy crawl test
Note: please change REDIS_URL in settings.py.
Spider like this:
from scrapy import Request, Spider
class TestSpider(Spider):
name = 'test'
base_url = 'https://www.baidu.com/s?wd='
def start_requests(self):
for i in range(10):
url = self.base_url + str(i)
yield Request(url, callback=self.parse)
# Here contains 10 duplicated Requests
for i in range(100):
url = self.base_url + str(i)
yield Request(url, callback=self.parse)
def parse(self, response):
self.logger.debug('Response of ' + response.url)
Result like this:
{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
'downloader/request_bytes': 34021,
'downloader/request_count': 100,
'downloader/request_method_count/GET': 100,
'downloader/response_bytes': 72943,
'downloader/response_count': 100,
'downloader/response_status_count/200': 100,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
'log_count/DEBUG': 202,
'log_count/INFO': 7,
'memusage/max': 54153216,
'memusage/startup': 54153216,
'response_received_count': 100,
'scheduler/dequeued/redis': 100,
'scheduler/enqueued/redis': 100,
'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Scrapy-Redis-BloomFilter-0.8.1.tar.gz
.
File metadata
- Download URL: Scrapy-Redis-BloomFilter-0.8.1.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53f769bf2c3a858aa0e8e19ff66f54557d22526f4fb7bb6219e1d9183db7beee |
|
MD5 | 4fe7039aa027a5c1b62b0a7cab9b365f |
|
BLAKE2b-256 | 739b40be6b7e3732928e17291040a7a5ad1e59c3a1e542789b9e4c9d18190d49 |
File details
Details for the file Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl
.
File metadata
- Download URL: Scrapy_Redis_BloomFilter-0.8.1-py2.py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8da71c092081f91f2091f93e0f240bef1869be69293138a643d1eb69fa55383f |
|
MD5 | d8f64ceefdd0cb3f10e075c4968f3473 |
|
BLAKE2b-256 | 5110c4cb99217b54fe769e8a45c75be19a400d196001d58d3bd0b53e5753cc34 |