Skip to main content

scrapy_redis use bloomfilter

Project description

Scrapy-Redis-BloomFilter

这是个scrapy_redis的布隆过滤器版本,与https://github.com/Python3WebSpider/ScrapyRedisBloomFilter 不同的是,该项目使用redis里的布隆过滤器,而不是使用bit来实现

必要条件

需要redis加载了布隆过滤器的插件,默认安装的redis是没有加载的 具体请看:https://redis.io/docs/stack/bloom/quick_start/

安装

使用pip: pip install scrapy-redis-bf

使用

在scrapy项目里的 settings.py添加如下设置:

SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# 格式:redis://[:password@]host[:port][/database][?[timeout=timeout[d|h|m|s|ms|us|ns]][&database=database]]
REDIS_URL = 'redis://localhost:6379'
# 错误率
BLOOMFILTER_ERRORRATE = 0.001
# 去重量
BLOOMFILTER_CAPACITY = 10000

测试

下载该项目,然后运行里面的test spider即可

Github

https://github.com/kanadeblisst/scrapy_redis_bf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_redis_bf-0.0.9.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_redis_bf-0.0.9-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_redis_bf-0.0.9.tar.gz.

File metadata

  • Download URL: scrapy_redis_bf-0.0.9.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for scrapy_redis_bf-0.0.9.tar.gz
Algorithm Hash digest
SHA256 aad1adbe23e0918d8eb954afab1647fe0c728d41eb341e499a7bc733b83fe48f
MD5 0971cadb77176bef845e75809d6d994c
BLAKE2b-256 650ded27c417490fc8f0ef5125956c2def2991e33136c6daecd0c80611933d5f

See more details on using hashes here.

File details

Details for the file scrapy_redis_bf-0.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_redis_bf-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 6302b176ef0afd808e3839e0c53202471be8c464f7aa2efe07e1738f9e191839
MD5 6f1bac4709f50f2c383928c4213576c5
BLAKE2b-256 2a737b45f1b8fd3425517b908df4908947048c65f9148a0ecab691339b34eb46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page