Utilities that are used by any spider of Behoof project
Project description
Overview
The bhfutils package is a collection of utilities that are used by any spider of Behoof project.
Requirements
Unix based machine (Linux or OS X)
Python 2.7 or 3.6
Installation
Inside a virtualenv, run pip install -U bhfutils. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)
Documentation
Full documentation for the bhfutils package does not exist
distributed_scheduler.py
The distributed_scheduler module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster
redis_domain_max_page_filter.py
The redis_domain_max_page_filter module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded
redis_dupefilter.py
The redis_dupefilter module is redis-based request duplication filter
redis_global_page_per_domain_filter.py
The redis_global_page_per_domain_filter module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bhfutils-0.0.66.tar.gz
.
File metadata
- Download URL: bhfutils-0.0.66.tar.gz
- Upload date:
- Size: 81.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.9.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d0c7439a1272498cb94ce131131d5837d68ee71705fca460b6b20ef84e88b05 |
|
MD5 | 0ff12cb579c11f5ba803e91196dfa29c |
|
BLAKE2b-256 | 71ed47c6d698bddd28ebda067d7f52ec9c85f4431fc74f5a8ab735d807ddd33f |