Skip to main content

新增 scrapy pipeline 对 item 目标字段组合除重

Project description

处理若干个pdf水印事项

pip install scrapy-bloomerfilter
  • scrapy 管道中添加配置
ITEM_PIPELINES = {'scrapy_bloomerfiler.bloomerfilerpipeline': 400},
  • scrapy.cfg 添加配置REDIS_HOST / REDIS_PORT / REDIS_DB / REDIS_PASSWORD
测试环境
[redis_cfg_dev]
REDIS_HOST = ***
REDIS_PORT = ***
REDIS_DB = ***
REDIS_PASSWORD= ***

正式环境
[redis_cfg_prod]
REDIS_HOST = ***
REDIS_PORT = ***
REDIS_DB = ***
REDIS_PASSWORD= ***
  • scrapy settings 中添加
IF_PROD 是否为正式环境配置 eg : True
Data_Size 数据体量 eg: 1000*10000/百万级/千万级/千万 
Aim_Set  除重依据字段 Aim_Set <dict>  eg: {"title","all_json"}
  • 环境变量中参数(可选)
IF_PROD 是否为正式环境配置 eg : True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_bloomerfiler-0.1.2.tar.gz (4.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page