Scrapy Middleware for limiting requests based on a counter.
Project description
Scrapy-count-filter
Two Downloader Middlewares that allows a Scrapy Spider to stop requests after a number of pages, or items are scraped. There is a similar functionality in the CloseSpider extension that stops spiders after a number of pages, items, or errors, but this middleware allows defining counters per domain, and define them as spider arguments instead of project settings.
Install
This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.
$ pip install scrapy-count-filter
Usage
For the middlewares to be enabled, they must be added in the project settings.py
:
DOWNLOADER_MIDDLEWARES = {
# maybe other Downloader Middlewares ...
# it's suggested to have the Count Filters after all the default middlewares
'scrapy_count_filter.middleware.GlobalCountFilterMiddleware': 995,
'scrapy_count_filter.middleware.HostsCountFilterMiddleware': 996,
}
You can use one, or the other, or both middlewares.
The counter limits must be defined in the spider instance, in a spider.count_limits
dict.
The possible fields are:
page_count
anditem_count
- are used by the GlobalCountFilterMiddleware to stop the spider, if the number of requests, or items scraped is larger than the value providedpage_host_count
anditem_host_count
- are used by the HostsCountFilterMiddleware to start ignoring requests, if the number of requests, or items scraped per host is larger than the value provided
All field values must be integers.
Note that the Spider stops when any of the counters overflow.
Example when the count of requests, and items scraped are active:
from scrapy.spiders import Spider
class MySpider(Spider):
count_limits = {"page_count": 99, "item_count": 10}
License
BSD3 © Cristi Constantin.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-count-filter-0.2.0.tar.gz
.
File metadata
- Download URL: scrapy-count-filter-0.2.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 388bd4f78964f02f4f0557a3ee4f9df820a7e0c273f35183d537708fce494b30 |
|
MD5 | 497e9aa5092bff1bff28c1c615bd549d |
|
BLAKE2b-256 | 6ff9f9a99b97df82439bfdba35b065c4efdc81967effc9d5993b5aa215172203 |
File details
Details for the file scrapy_count_filter-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: scrapy_count_filter-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5da87c45770d538d307982968bd96803765b3b8e1dd713df7d65423752d659fd |
|
MD5 | ad52cd1483e3c723a81fb80c9fc40158 |
|
BLAKE2b-256 | ffa84360acb72a5558ab65b8f09c613c28b49c46d52f18a921b48f158669db86 |