Skip to main content

🕷️ A lightweight, generic parallel runner for custom scrapers

Project description

spidur 🕷️

PyPI version License Tests

spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.


✨ Core ideas

  • Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
  • Parallel execution — utilizes all CPU cores.
  • Async + multiprocessing safe — works across async methods and process pools.
  • No opinions — you control discovery, validation, and scraping logic.
  • Results collected automatically — each scraper contributes to a single aggregated result set.

📦 Install

pip install spidur

or with Poetry:

poetry add spidur

⚡ Example

from typing import Any

from spidur import Runner, Scraper, ScraperFactory, ScrapeResult, Target


# --- define your scrapers ---
# Implement three small hooks. The discover -> validate -> scrape loop is
# provided for you by Scraper.fetch_round(), so you never write it yourself.

class ArticleScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "article", "url": url, "data": f"Content of {url}"}


class CommentScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}


# --- register both scrapers ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---
# Put the code above in a module (e.g. `my_scrapers.py`) and pass its name as
# `bootstrap` so each worker process re-registers the scrapers. This is
# required on macOS/Windows, where multiprocessing uses the `spawn` start
# method and workers do not inherit the parent's registrations.

results = Runner.run(targets, bootstrap=["my_scrapers"])

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)

🖥️ Command line

spidur ships a small CLI. Point it at the module(s) that register your scrapers and pass one or more name=start_url targets:

spidur --module my_scrapers articles=https://example.com/articles

Results are written to stdout as JSON. Use -v for info-level logging.


🧠 How it works

  1. Each Scraper subclass defines three single-purpose hooks:

    • is_valid_url(url) — a pure predicate; keeps invalid URLs out of scope.
    • discover_urls(page, known) — finds new pages to scrape.
    • scrape_page(page, url) — extracts structured data.

    The discover → validate → scrape orchestration is provided once by the base class as fetch_round(known), so subclasses never re-implement it.

  2. You register scrapers in ScraperFactory.

  3. The Runner:

    • Spawns multiple processes.
    • Executes all scrapers concurrently.
    • Aggregates their results into a single dictionary keyed by scraper name.

🧪 Development

Install the project with its dev tooling and run the full quality gate:

poetry install
poetry run ruff check .     # lint
poetry run black --check .  # formatting
poetry run mypy             # static types (strict)
poetry run pytest           # tests

spidur is fully type-annotated and ships a py.typed marker, so downstream projects get type checking for free.

🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.3.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spidur-0.3.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file spidur-0.3.0.tar.gz.

File metadata

  • Download URL: spidur-0.3.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.0.tar.gz
Algorithm Hash digest
SHA256 23e5df7a3a49825e9be46ef5469664da4abffecfed53cb4db203b1e6260e2d14
MD5 3d634369e423a817bfaf63e123d097b7
BLAKE2b-256 77542d384bd5d69b7fc376c35c6604756b6e69048d1fad74be04b183471eaa09

See more details on using hashes here.

File details

Details for the file spidur-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: spidur-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f252438f5808ea709361d05d96bfe6c18b384a477cc3d6e59295ef3ba60ec28b
MD5 164ac8f73d35be71de4f1eed775fd098
BLAKE2b-256 6da41b1a6fecd5f5dce2c6cc77598537ee2871dfde53c24578888b7bd20ad663

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page