Skip to main content

Scrapy extension for monitoring spider.

Project description

spider-info-webservice

Tests

Scrapy extension for monitoring your spiders.

How to access it, if I have million spiders on one machine? Easy! This extension have default_start_callback, that sends request to your INFO_SERVICE_REPORT_URL with its own unique URL for each spider.

Inspired by Scrapy's built-in Telnet console extension and deprecated scrapy-jsonrpc (https://github.com/scrapy-plugins/scrapy-jsonrpc) and also presented in Scrapy 0.24 WebService built-in extension (https://docs.scrapy.org/en/0.24/topics/webservice.html).

Every time I used Telnet, it all came down to calling several methods to show information, so I made this extension, which conveniently serves basic information about spider via HTTP.

Installation

pip install spider-info-webservice

or

pip install git+https://github.com/abebus/spider-info-webservice

Dependencies

Scrapy >= 2.6

Python from 3.8 to 3.12

Usage

Add spider_info_webservice.InfoService to EXTENSIONS in settings.py

EXTENSIONS = {
    "spider_info_webservice.InfoService": 500
}

All done! Now you can access to endpoints via your favourite cli http request tool or browser.

Extension settings

INFO_SERVICE_PORTRANGE: defaults to (6024, 8000).

INFO_SERVICE_HOST: defaults to "127.0.0.1".

INFO_SERVICE_USERS: defaults to {"scrapy": b"scrapy"}. Dictionary of type dict[str, bytes] containing key-value pairs like username: password, used for basic HTTP auth.

INFO_SERVICE_REPORT_URL: optional. Extension will send a request to a given url with json containing general info about running spider and host:port of this service.

INFO_SERVICE_SENSITIVE_KEYS: optional. Defaults to [r"^INFO_SERVICE_USERS$", r".*_PASS(?:WORD)?$", r".*_USER(?:NAME)?$"]. List of strings, that will compile to regex. They will try to match all keys in settings (recursively) and if key is matched, replace value with asterisks.

INFO_SERVICE_RESOURCES_CHILD_PREFIX: optional. Prefix for accesing child resources from extension.

INFO_SERVICE_RESOURCES: optional. List of resources dicts like:

{
  "name": b"name_of_resource",
  "class": "path.to.ResourseClass",
  "args": [args, that, resource, needs] # optional
  "kwargs": {"kwargs": for_resource} # optional
}

All resources are being initialised at scrapy.signals.spider_opened signal in prep_resources method. If you want to modify available resources, redefine this list at settings.py. For more control over args and kwargs that you could pass to resource, redefine this setting at spider_opened method in your Spider class or derive from this extension and override prep_resources method.

Endpoints

info/general: General info.

Example response:

{
  "pid": 1605,
  "project_name": "quotes_scraper/name_from_scrapy.cfg",
  "bot_name": "quotes_scraper/name_from_settings",
  "spider_name": "quote-spider",
  "info_service_host": "127.0.0.1",
  "info_service_port": 6024,
  "base_versions": {
    "Scrapy": "2.11.2",
    "lxml": "5.2.2.0",
    "libxml2": "2.12.6",
    "cssselect": "1.2.0",
    "parsel": "1.9.1",
    "w3lib": "2.2.1",
    "Twisted": "24.3.0",
    "Python": "3.12.4 (main, Jun 12 2024, 19:06:53) [GCC 13.2.0]",
    "pyOpenSSL": "24.1.0 (OpenSSL 3.2.2 4 Jun 2024)",
    "cryptography": "42.0.8",
    "Platform": "Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.38"
  },
  "available_resources": [
    {
      "name": "/info",
      "doc": "Root resource, only used for the /info/ endpoint, no other uses",
      "methods": [
        "GET"
      ]
    },
    {
      "name": "/info/engine",
      "doc": "Engine status resource, returns get_engine_status(curr_engine) imported from scrapy.utils.engine",
      "methods": [
        "GET"
      ]
    },
    {
      "name": "/info/stats",
      "doc": "Stats resource, returns crawler.stats.get_stats()",
      "methods": [
        "GET"
      ]
    },
    {
      "name": "/info/settings",
      "doc": "Settings resource, returns crawler.settings",
      "methods": [
        "GET"
      ]
    },
    {
      "name": "/info/slot",
      "doc": "Slot resource, returns engine's slot.inprogress request.to_dict()",
      "methods": [
        "GET"
      ]
    },
    {
      "name": "/info/general",
      "doc": "General data resource, returns the general data of the crawler. (You are currently here)",
      "methods": [
        "GET"
      ]
    }
  ]
}

info/stats: Spider stats (crawler.stats.get_stats()).

Example response:

{
  "log_count/WARNING": 1,
  "log_count/DEBUG": 6,
  "log_count/INFO": 16,
  "start_time": "2024-08-18T19:33:17.895850+00:00",
  "memusage/startup": 76251136,
  "memusage/max": 78032896,
  "scheduler/enqueued/memory": 2,
  "scheduler/enqueued": 2,
  "scheduler/dequeued/memory": 2,
  "scheduler/dequeued": 2,
  "downloader/request_count": 2,
  "downloader/request_method_count/GET": 2,
  "downloader/request_bytes": 474,
  "downloader/response_count": 2,
  "downloader/response_status_count/200": 2,
  "downloader/response_bytes": 5052,
  "httpcompression/response_bytes": 24789,
  "httpcompression/response_count": 2,
  "response_received_count": 2
}

info/slot: list of in-progress requests.

Example response:

{
  "in_progress_requests": [
    {
      "url": "http://quotes.toscrape.com/page/2/",
      "callback": null,
      "errback": null,
      "headers": {
        "Accept": [
          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
        ],
        "Accept-Language": [
          "en"
        ],
        "User-Agent": [
          "Scrapy/2.11.2 (+https://scrapy.org)"
        ],
        "Accept-Encoding": [
          "gzip, deflate, br, zstd"
        ]
      },
      "method": "GET",
      "body": "",
      "cookies": {},
      "meta": {
        "download_timeout": 180,
        "download_slot": "quotes.toscrape.com",
        "download_latency": 0.4519026279449463,
        "depth": 0
      },
      "encoding": "utf-8",
      "priority": 0,
      "dont_filter": true,
      "flags": [],
      "cb_kwargs": {}
    },
    {
      "url": "http://quotes.toscrape.com/page/1/",
      "callback": null,
      "errback": null,
      "headers": {
        "Accept": [
          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
        ],
        "Accept-Language": [
          "en"
        ],
        "User-Agent": [
          "Scrapy/2.11.2 (+https://scrapy.org)"
        ],
        "Accept-Encoding": [
          "gzip, deflate, br, zstd"
        ]
      },
      "method": "GET",
      "body": "",
      "cookies": {},
      "meta": {
        "download_timeout": 180,
        "download_slot": "quotes.toscrape.com",
        "download_latency": 0.4561948776245117,
        "depth": 0
      },
      "encoding": "utf-8",
      "priority": 0,
      "dont_filter": true,
      "flags": [],
      "cb_kwargs": {}
    }
  ]
}

info/engine: Info about execution engine.

Example response:

{
  "time()-engine.start_time": 423.9953444004059,
  "len(engine.downloader.active)": 0,
  "engine.scraper.is_idle()": false,
  "engine.spider.name": "quote-spider",
  "engine.spider_is_idle()": false,
  "engine.slot.closing": null,
  "len(engine.slot.inprogress)": 2,
  "len(engine.slot.scheduler.dqs or [])": 0,
  "len(engine.slot.scheduler.mqs)": 0,
  "len(engine.scraper.slot.queue)": 0,
  "len(engine.scraper.slot.active)": 2,
  "engine.scraper.slot.active_size": 24789,
  "engine.scraper.slot.itemproc_size": 0,
  "engine.scraper.slot.needs_backout()": false
}

info/settings: Spider settings. When passing "all=true" as param, will return all the existing settings, when passing "all=false", will return only non-default settings.

Example response:

{
    "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    "TELNETCONSOLE_ENABLED": false,
    "INFO_SERVICE_USERS": {
        "scrapy": "scrapy",
        "test": "test",
        "test2": "test2"
    },
    "VERY_SENSETIVE_INFO": "******",
    "SENSETIVE_INFO_1": "******",
    "SENSETIVE_INFO_2": "******",
    "SENSETIVE_INFO_3": "******",
    "INFO_SERVICE_SENSITIVE_KEYS": [
        "^.*SENSETIVE_INFO.*$"
    ]
}

Tests

Yes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spider_info_webservice-0.0.4.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

spider_info_webservice-0.0.4-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file spider_info_webservice-0.0.4.tar.gz.

File metadata

File hashes

Hashes for spider_info_webservice-0.0.4.tar.gz
Algorithm Hash digest
SHA256 34555706323da3b33f510bcfe4d9bfa0c734ee95fc36dc993f29b4e4db9918bc
MD5 0b677f1e0c0908c2aa99c1a81f707a40
BLAKE2b-256 b97a439d4ceb21b0e4dfe35bcca09f567001a77b2563c69087f8eaf7216f4cec

See more details on using hashes here.

File details

Details for the file spider_info_webservice-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for spider_info_webservice-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e50ed14ccb189f207219407849f9e407e060c5717c07f26f3f850e525931c49a
MD5 cb6519fb0b95d45885fe851c361fa923
BLAKE2b-256 09a1f78bbc23f4487893275638193161c63a67dc034b698c433a1c0e42923d0d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page