Skip to main content

🕷️ A lightweight, generic parallel runner for custom scrapers

Project description

spidur 🕷️

PyPI version License Tests

spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.


Core ideas

  • Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
  • Parallel execution — utilizes all CPU cores.
  • Async + multiprocessing safe — works across async methods and process pools.
  • No opinions — you control discovery, validation, and scraping logic.
  • Results collected automatically — each scraper contributes to a single aggregated result set.

Install

pip install spidur

or with Poetry:

poetry add spidur

Example

from typing import Any

from spidur import Runner, Scraper, ScraperFactory, ScrapeResult, Target


# --- define your scrapers ---
# Implement three small hooks. The discover -> validate -> scrape loop is
# provided for you by Scraper.fetch_round(), so you never write it yourself.

class ArticleScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "article", "url": url, "data": f"Content of {url}"}


class CommentScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}


# --- register both scrapers ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---
# Put the code above in a module (e.g. `my_scrapers.py`) and pass its name as
# `bootstrap` so each worker process re-registers the scrapers. This is
# required on macOS/Windows, where multiprocessing uses the `spawn` start
# method and workers do not inherit the parent's registrations.

results = Runner.run(targets, bootstrap=["my_scrapers"])

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)

Command line

spidur ships a small CLI. Point it at the module(s) that register your scrapers and pass one or more name=start_url targets:

spidur --module my_scrapers articles=https://example.com/articles

Results are written to stdout as JSON. Use -v for info-level logging.


How it works

  1. Each Scraper subclass defines three single-purpose hooks:

    • is_valid_url(url) — a pure predicate; keeps invalid URLs out of scope.
    • discover_urls(page, known) — finds new pages to scrape.
    • scrape_page(page, url) — extracts structured data.

    The discover → validate → scrape orchestration is provided once by the base class as fetch_round(known), so subclasses never re-implement it.

  2. You register scrapers in ScraperFactory.

  3. The Runner:

    • Spawns multiple processes.
    • Executes all scrapers concurrently.
    • Aggregates their results into a single dictionary keyed by scraper name.

Development

Install the project with its dev tooling and run the full quality gate:

poetry install
poetry run ruff check .     # lint
poetry run black --check .  # formatting
poetry run mypy             # static types (strict)
poetry run pytest           # tests

spidur is fully type-annotated and ships a py.typed marker, so downstream projects get type checking for free.

Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.3.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spidur-0.3.1-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file spidur-0.3.1.tar.gz.

File metadata

  • Download URL: spidur-0.3.1.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.1.tar.gz
Algorithm Hash digest
SHA256 476e8ae9d028948e3867559914e0a05f16ba401fe6b9729859dd9eea87275407
MD5 451c536cde1bf49d341095162dce360e
BLAKE2b-256 a49be8083be5e21f38f7df46dc59cf160324c98df91194bdd8d0faf337dd47bf

See more details on using hashes here.

File details

Details for the file spidur-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: spidur-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6a82fff9159e1c7589e59284d88ecb0927fcd0a887a7d3f6a67e2b6552cca71
MD5 ca5f2cfc86fce2ed52a1ed49f08832b9
BLAKE2b-256 24d96d23818c52cc86088ce1bddae1ea8b3d59b1ddda3c77c8b060e6b86a3858

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page