Scrapy Middleware for limiting requests based on a counter.
Project description
Scrapy-count-filter
Two Downloader Middlewares that allows a Scrapy Spider to stop requests after a number of pages, or items are scraped. There is a similar functionality in the CloseSpider extension that stops spiders after a number of pages, items, or errors, but this middleware allows defining counters per domain, and define them as spider arguments instead of project settings.
Install
This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.
$ pip install scrapy-count-filter
Usage
For the middlewares to be enabled, they must be added in the project settings.py:
DOWNLOADER_MIDDLEWARES = {
# maybe other Downloader Middlewares ...
# it's suggested to have the Count Filters after all the default middlewares
'scrapy_count_filter.middleware.GlobalCountFilterMiddleware': 995,
'scrapy_count_filter.middleware.HostsCountFilterMiddleware': 996,
}
You can use one, or the other, or both middlewares.
The counter limits must be defined in the spider instance, in a spider.count_limits dict.
The possible fields are:
page_countanditem_count- are used by the GlobalCountFilterMiddleware to stop the spider, if the number of requests, or items scraped is larger than the value providedpage_host_countanditem_host_count- are used by the HostsCountFilterMiddleware to start ignoring requests, if the number of requests, or items scraped per host is larger than the value provided
All field values must be integers.
Note that the Spider stops when any of the counters overflow.
Example when the count of requests, and items scraped are active:
from scrapy.spiders import Spider
class MySpider(Spider):
count_limits = {"page_count": 99, "item_count": 10}
License
BSD3 © Cristi Constantin.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-count-filter-0.2.0.tar.gz.
File metadata
- Download URL: scrapy-count-filter-0.2.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
388bd4f78964f02f4f0557a3ee4f9df820a7e0c273f35183d537708fce494b30
|
|
| MD5 |
497e9aa5092bff1bff28c1c615bd549d
|
|
| BLAKE2b-256 |
6ff9f9a99b97df82439bfdba35b065c4efdc81967effc9d5993b5aa215172203
|
File details
Details for the file scrapy_count_filter-0.2.0-py3-none-any.whl.
File metadata
- Download URL: scrapy_count_filter-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5da87c45770d538d307982968bd96803765b3b8e1dd713df7d65423752d659fd
|
|
| MD5 |
ad52cd1483e3c723a81fb80c9fc40158
|
|
| BLAKE2b-256 |
ffa84360acb72a5558ab65b8f09c613c28b49c46d52f18a921b48f158669db86
|