Skip to main content

Utilities that are used by any spider of Behoof project

Project description

Overview

The bhfutils package is a collection of utilities that are used by any spider of Behoof project.

Requirements

  • Unix based machine (Linux or OS X)

  • Python 2.7 or 3.6

Installation

Inside a virtualenv, run pip install -U bhfutils. This will install the latest version of the Behoof Scrapy Cluster Spider utilities. After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation

Full documentation for the bhfutils package does not exist

custom_cookies.py

The custom_cookies module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py

The distributed_scheduler module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py

The redis_domain_max_page_filter module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded

redis_dupefilter.py

The redis_dupefilter module is redis-based request duplication filter

redis_global_page_per_domain_filter.py

The redis_global_page_per_domain_filter module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.

Project details


Release history Release notifications | RSS feed

This version

0.2.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bhfutils-0.2.2.tar.gz (91.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bhfutils-0.2.2-py3-none-any.whl (125.1 kB view details)

Uploaded Python 3

File details

Details for the file bhfutils-0.2.2.tar.gz.

File metadata

  • Download URL: bhfutils-0.2.2.tar.gz
  • Upload date:
  • Size: 91.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for bhfutils-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a03710e0d776418471f07d0fc44ef3d3d28850648cd6e4f62e2b2233996d2593
MD5 46553373cc5c3d92379a34ce4179bbc2
BLAKE2b-256 b97d2b6df5434ca536ccd6fe7bcafd3b0dfe7b7c8b93adbaae92f64db81447ad

See more details on using hashes here.

File details

Details for the file bhfutils-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: bhfutils-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 125.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for bhfutils-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 962d4ff654b0571b8e247ba9c7670ab330dbe9badea64f020b0dddeb9a1b7918
MD5 021ab2f1c34fa243fb6b34e393ca1eab
BLAKE2b-256 68c0eec1a828021549f1a8805661312b4ad4452c344e11e48db8d4b7d1a65403

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page