Skip to main content

Scrapy Middleware for limiting requests based on a counter.

Project description

Scrapy-count-filter

Python ver Build Status Code coverage Code style: black

Two Downloader Middlewares that allows a Scrapy Spider to stop requests after a number of pages, or items are scraped. There is a similar functionality in the CloseSpider extension that stops spiders after a number of pages, items, or errors, but this middleware allows defining counters per domain, and define them as spider arguments instead of project settings.

Install

This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.

$ pip install scrapy-count-filter

Usage

For the middlewares to be enabled, they must be added in the project settings.py:

DOWNLOADER_MIDDLEWARES = {
    # maybe other Downloader Middlewares ...
    # it's suggested to have the Count Filters after all the default middlewares
    'scrapy_count_filter.middleware.GlobalCountFilterMiddleware': 995,
    'scrapy_count_filter.middleware.HostsCountFilterMiddleware': 996,
}

You can use one, or the other, or both middlewares.

The counter limits must be defined in the spider instance, in a spider.count_limits dict.

The possible fields are:

  • page_count and item_count - are used by the GlobalCountFilterMiddleware to stop the spider, if the number of requests, or items scraped is larger than the value provided
  • page_host_count and item_host_count - are used by the HostsCountFilterMiddleware to start ignoring requests, if the number of requests, or items scraped per host is larger than the value provided

All field values must be integers.

Note that the Spider stops when any of the counters overflow.

Example when the count of requests, and items scraped are active:

from scrapy.spiders import Spider

class MySpider(Spider):
    count_limits = {"page_count": 99, "item_count": 10}

License

BSD3 © Cristi Constantin.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-count-filter, version 0.2.0
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_count_filter-0.2.0-py3-none-any.whl (6.2 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size scrapy-count-filter-0.2.0.tar.gz (5.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page