Scrapy extension for monitoring spider.
Project description
spider-info-webservice
Scrapy extension for monitoring your spiders.
How to access it, if I have million spiders on one machine? Easy! This extension have default_start_callback
, that sends request to your INFO_SERVICE_REPORT_URL
with its own unique URL for each spider.
Inspired by Scrapy's built-in Telnet console extension and deprecated scrapy-jsonrpc
(https://github.com/scrapy-plugins/scrapy-jsonrpc) and also presented in Scrapy 0.24 WebService built-in extension (https://docs.scrapy.org/en/0.24/topics/webservice.html).
Every time I used Telnet, it all came down to calling several methods to show information, so I made this extension, which conveniently serves basic information about spider via HTTP.
Installation
pip install spider-info-webservice
or
pip install git+https://github.com/abebus/spider-info-webservice
Dependencies
Scrapy >= 2.6
Python from 3.8 to 3.12
Usage
Add spider_info_webservice.InfoService
to EXTENSIONS in settings.py
EXTENSIONS = {
"spider_info_webservice.InfoService": 500
}
All done! Now you can access to endpoints via your favourite cli http request tool or browser.
Extension settings
INFO_SERVICE_PORTRANGE
: defaults to (6024, 8000)
.
INFO_SERVICE_HOST
: defaults to "127.0.0.1"
.
INFO_SERVICE_USERS
: defaults to {"scrapy": b"scrapy"}
. Dictionary of type dict[str, bytes]
containing key-value pairs like username: password
, used for basic HTTP auth.
INFO_SERVICE_REPORT_URL
: optional. Extension will send a request to a given url with json containing general info about running spider and host:port
of this service.
INFO_SERVICE_SENSITIVE_KEYS
: optional. Defaults to [r"^INFO_SERVICE_USERS$", r".*_PASS(?:WORD)?$", r".*_USER(?:NAME)?$"]
. List of strings, that will compile to regex. They will try to match all keys in settings
(recursively) and if key is matched, replace value with asterisks.
INFO_SERVICE_RESOURCES_CHILD_PREFIX
: optional. Prefix for accesing child resources from extension.
INFO_SERVICE_RESOURCES
: optional. List of resources dicts like:
{
"name": b"name_of_resource",
"class": "path.to.ResourseClass",
"args": [args, that, resource, needs] # optional
"kwargs": {"kwargs": for_resource} # optional
}
All resources are being initialised at scrapy.signals.spider_opened
signal in prep_resources
method. If you want to modify available resources, redefine this list at settings.py
. For more control over args
and kwargs
that you could pass to resource, redefine this setting at spider_opened
method in your Spider
class or derive from this extension and override prep_resources
method.
Endpoints
info/general
: General info.
Example response:
{
"pid": 1605,
"project_name": "quotes_scraper/name_from_scrapy.cfg",
"bot_name": "quotes_scraper/name_from_settings",
"spider_name": "quote-spider",
"info_service_host": "127.0.0.1",
"info_service_port": 6024,
"base_versions": {
"Scrapy": "2.11.2",
"lxml": "5.2.2.0",
"libxml2": "2.12.6",
"cssselect": "1.2.0",
"parsel": "1.9.1",
"w3lib": "2.2.1",
"Twisted": "24.3.0",
"Python": "3.12.4 (main, Jun 12 2024, 19:06:53) [GCC 13.2.0]",
"pyOpenSSL": "24.1.0 (OpenSSL 3.2.2 4 Jun 2024)",
"cryptography": "42.0.8",
"Platform": "Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.38"
},
"available_resources": [
{
"name": "/info",
"doc": "Root resource, only used for the /info/ endpoint, no other uses",
"methods": [
"GET"
]
},
{
"name": "/info/engine",
"doc": "Engine status resource, returns get_engine_status(curr_engine) imported from scrapy.utils.engine",
"methods": [
"GET"
]
},
{
"name": "/info/stats",
"doc": "Stats resource, returns crawler.stats.get_stats()",
"methods": [
"GET"
]
},
{
"name": "/info/settings",
"doc": "Settings resource, returns crawler.settings",
"methods": [
"GET"
]
},
{
"name": "/info/slot",
"doc": "Slot resource, returns engine's slot.inprogress request.to_dict()",
"methods": [
"GET"
]
},
{
"name": "/info/general",
"doc": "General data resource, returns the general data of the crawler. (You are currently here)",
"methods": [
"GET"
]
}
]
}
info/stats
: Spider stats (crawler.stats.get_stats()
).
Example response:
{
"log_count/WARNING": 1,
"log_count/DEBUG": 6,
"log_count/INFO": 16,
"start_time": "2024-08-18T19:33:17.895850+00:00",
"memusage/startup": 76251136,
"memusage/max": 78032896,
"scheduler/enqueued/memory": 2,
"scheduler/enqueued": 2,
"scheduler/dequeued/memory": 2,
"scheduler/dequeued": 2,
"downloader/request_count": 2,
"downloader/request_method_count/GET": 2,
"downloader/request_bytes": 474,
"downloader/response_count": 2,
"downloader/response_status_count/200": 2,
"downloader/response_bytes": 5052,
"httpcompression/response_bytes": 24789,
"httpcompression/response_count": 2,
"response_received_count": 2
}
info/slot
: list of in-progress requests.
Example response:
{
"in_progress_requests": [
{
"url": "http://quotes.toscrape.com/page/2/",
"callback": null,
"errback": null,
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
],
"Accept-Language": [
"en"
],
"User-Agent": [
"Scrapy/2.11.2 (+https://scrapy.org)"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
]
},
"method": "GET",
"body": "",
"cookies": {},
"meta": {
"download_timeout": 180,
"download_slot": "quotes.toscrape.com",
"download_latency": 0.4519026279449463,
"depth": 0
},
"encoding": "utf-8",
"priority": 0,
"dont_filter": true,
"flags": [],
"cb_kwargs": {}
},
{
"url": "http://quotes.toscrape.com/page/1/",
"callback": null,
"errback": null,
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
],
"Accept-Language": [
"en"
],
"User-Agent": [
"Scrapy/2.11.2 (+https://scrapy.org)"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
]
},
"method": "GET",
"body": "",
"cookies": {},
"meta": {
"download_timeout": 180,
"download_slot": "quotes.toscrape.com",
"download_latency": 0.4561948776245117,
"depth": 0
},
"encoding": "utf-8",
"priority": 0,
"dont_filter": true,
"flags": [],
"cb_kwargs": {}
}
]
}
info/engine
: Info about execution engine.
Example response:
{
"time()-engine.start_time": 423.9953444004059,
"len(engine.downloader.active)": 0,
"engine.scraper.is_idle()": false,
"engine.spider.name": "quote-spider",
"engine.spider_is_idle()": false,
"engine.slot.closing": null,
"len(engine.slot.inprogress)": 2,
"len(engine.slot.scheduler.dqs or [])": 0,
"len(engine.slot.scheduler.mqs)": 0,
"len(engine.scraper.slot.queue)": 0,
"len(engine.scraper.slot.active)": 2,
"engine.scraper.slot.active_size": 24789,
"engine.scraper.slot.itemproc_size": 0,
"engine.scraper.slot.needs_backout()": false
}
info/settings
: Spider settings. When passing "all=true"
as param, will return all the existing settings, when passing "all=false"
, will return only non-default settings.
Example response:
{
"REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
"TELNETCONSOLE_ENABLED": false,
"INFO_SERVICE_USERS": {
"scrapy": "scrapy",
"test": "test",
"test2": "test2"
},
"VERY_SENSETIVE_INFO": "******",
"SENSETIVE_INFO_1": "******",
"SENSETIVE_INFO_2": "******",
"SENSETIVE_INFO_3": "******",
"INFO_SERVICE_SENSITIVE_KEYS": [
"^.*SENSETIVE_INFO.*$"
]
}
Tests
Yes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spider_info_webservice-0.0.4.tar.gz
.
File metadata
- Download URL: spider_info_webservice-0.0.4.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34555706323da3b33f510bcfe4d9bfa0c734ee95fc36dc993f29b4e4db9918bc |
|
MD5 | 0b677f1e0c0908c2aa99c1a81f707a40 |
|
BLAKE2b-256 | b97a439d4ceb21b0e4dfe35bcca09f567001a77b2563c69087f8eaf7216f4cec |
File details
Details for the file spider_info_webservice-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: spider_info_webservice-0.0.4-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e50ed14ccb189f207219407849f9e407e060c5717c07f26f3f850e525931c49a |
|
MD5 | cb6519fb0b95d45885fe851c361fa923 |
|
BLAKE2b-256 | 09a1f78bbc23f4487893275638193161c63a67dc034b698c433a1c0e42923d0d |