Skip to main content

scrapy_redis use bloomfilter

Project description

Scrapy-Redis-BloomFilter

这是个scrapy_redis的布隆过滤器版本,与https://github.com/Python3WebSpider/ScrapyRedisBloomFilter 不同的是,该项目使用redis里的布隆过滤器,而不是使用bit来实现

必要条件

需要redis加载了布隆过滤器的插件,默认安装的redis是没有加载的 具体请看:https://redis.io/docs/stack/bloom/quick_start/

安装

使用pip: pip install scrapy-redis-bf

使用

在scrapy项目里的 settings.py添加如下设置:

SCHEDULER = "scrapy_redis_bf.scheduler.Scheduler"

DUPEFILTER_CLASS = "scrapy_redis_bf.dupefilter.RFPDupeFilter"

# 格式:redis://[:password@]host[:port][/database][?[timeout=timeout[d|h|m|s|ms|us|ns]][&database=database]]
REDIS_URL = 'redis://localhost:6379'
# 错误率
BLOOMFILTER_ERRORRATE = 0.001
# 去重量
BLOOMFILTER_CAPACITY = 10000

测试

下载该项目,然后运行里面的test spider即可

Github

https://github.com/kanadeblisst/scrapy_redis_bf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_redis_bf-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

scrapy_redis_bf-0.1.0-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_redis_bf-0.1.0.tar.gz.

File metadata

  • Download URL: scrapy_redis_bf-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for scrapy_redis_bf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 35abc366972ec72da214294d43a7da26aafb4e2eae872640b9624355a43eff60
MD5 f7efbe56958de214d93e0b0a52aae4de
BLAKE2b-256 c271dc6607784c72daab664773023b2b6befda0f8a325670173c65e664dd8ef2

See more details on using hashes here.

File details

Details for the file scrapy_redis_bf-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_redis_bf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9674d53313c0fa379d49056937386c7aab5b6aba3834dd553cd3bcb4dcae1bcc
MD5 f7b760f35665aa076dad1be717218627
BLAKE2b-256 78e99c3343c6c40f0f4ee0782e136a8f44f3c69854089900ef37d5256ba9827b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page