Skip to main content

Scrapy Middleware for limiting requests based on a counter.

Project description

Scrapy-count-filter

Python ver Build Status Code coverage Code style: black

Two Downloader Middlewares that allows a Scrapy Spider to stop requests after a number of pages, or items are scraped. There is a similar functionality in the CloseSpider extension that stops spiders after a number of pages, items, or errors, but this middleware allows defining counters per domain, and define them as spider arguments instead of project settings.

Install

This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.

$ pip install scrapy-count-filter

Usage

For the middlewares to be enabled, they must be added in the project settings.py:

DOWNLOADER_MIDDLEWARES = {
    # maybe other Downloader Middlewares ...
    # it's suggested to have the Count Filters after all the default middlewares
    'scrapy_count_filter.middleware.GlobalCountFilterMiddleware': 995,
    'scrapy_count_filter.middleware.HostsCountFilterMiddleware': 996,
}

You can use one, or the other, or both middlewares.

The counter limits must be defined in the spider instance, in a spider.count_limits dict.

The possible fields are:

  • page_count and item_count - are used by the GlobalCountFilterMiddleware to stop the spider, if the number of requests, or items scraped is larger than the value provided
  • page_host_count and item_host_count - are used by the HostsCountFilterMiddleware to start ignoring requests, if the number of requests, or items scraped per host is larger than the value provided

All field values must be integers.

Note that the Spider stops when any of the counters overflow.

Example when the count of requests, and items scraped are active:

from scrapy.spiders import Spider

class MySpider(Spider):
    count_limits = {"page_count": 99, "item_count": 10}

License

BSD3 © Cristi Constantin.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-count-filter-0.2.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

scrapy_count_filter-0.2.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-count-filter-0.2.0.tar.gz.

File metadata

  • Download URL: scrapy-count-filter-0.2.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9

File hashes

Hashes for scrapy-count-filter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 388bd4f78964f02f4f0557a3ee4f9df820a7e0c273f35183d537708fce494b30
MD5 497e9aa5092bff1bff28c1c615bd549d
BLAKE2b-256 6ff9f9a99b97df82439bfdba35b065c4efdc81967effc9d5993b5aa215172203

See more details on using hashes here.

File details

Details for the file scrapy_count_filter-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scrapy_count_filter-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9

File hashes

Hashes for scrapy_count_filter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5da87c45770d538d307982968bd96803765b3b8e1dd713df7d65423752d659fd
MD5 ad52cd1483e3c723a81fb80c9fc40158
BLAKE2b-256 ffa84360acb72a5558ab65b8f09c613c28b49c46d52f18a921b48f158669db86

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page