Skip to main content

Utilities that are used by any spider of Behoof project

Project description

Overview

The bhfutils package is a collection of utilities that are used by any spider of Behoof project.

Requirements

  • Unix based machine (Linux or OS X)

  • Python 2.7 or 3.6

Installation

Inside a virtualenv, run pip install -U bhfutils. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation

Full documentation for the bhfutils package does not exist

custom_cookies.py

The custom_cookies module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py

The distributed_scheduler module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py

The redis_domain_max_page_filter module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded

redis_dupefilter.py

The redis_dupefilter module is redis-based request duplication filter

redis_global_page_per_domain_filter.py

The redis_global_page_per_domain_filter module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bhfutils-0.1.15.tar.gz (91.1 kB view details)

Uploaded Source

Built Distribution

bhfutils-0.1.15-py3-none-any.whl (124.7 kB view details)

Uploaded Python 3

File details

Details for the file bhfutils-0.1.15.tar.gz.

File metadata

  • Download URL: bhfutils-0.1.15.tar.gz
  • Upload date:
  • Size: 91.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for bhfutils-0.1.15.tar.gz
Algorithm Hash digest
SHA256 4bf9a81b3db626fede8057aafc001a0d0a3b4b79a5c35f218e0454ca114137d9
MD5 046e62dc1df6e12aff74f0f3e8762cc6
BLAKE2b-256 e07c4f14e9b9b5b57b7ba6965e510de7f8fd0b607d8781d5573210bd9eefb497

See more details on using hashes here.

File details

Details for the file bhfutils-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: bhfutils-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 124.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for bhfutils-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 6b5d4e87b51c1c38065d15f75e324f980c2fb0247abcb636482167fe3d5e215f
MD5 0c64ac88141191ba6deea962b45fd890
BLAKE2b-256 7b4252a200ae04d7c4c52ea9b3fcdb7daa2137a6eec1d258b1989d4963e93b87

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page