🕷️ A lightweight, generic parallel runner for custom scrapers

These details have not been verified by PyPI

Project links

Project description

spidur 🕷️

spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.

✨ Core ideas

Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
Parallel execution — utilizes all CPU cores.
Async + multiprocessing safe — works across async methods and process pools.
No opinions — you control discovery, validation, and scraping logic.
Results collected automatically — each scraper contributes to a single aggregated result set.

📦 Install

pip install spidur

or with Poetry:

poetry add spidur

⚡ Example

from typing import Any

from spidur import Runner, Scraper, ScraperFactory, ScrapeResult, Target


# --- define your scrapers ---
# Implement three small hooks. The discover -> validate -> scrape loop is
# provided for you by Scraper.fetch_round(), so you never write it yourself.

class ArticleScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "article", "url": url, "data": f"Content of {url}"}


class CommentScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}


# --- register both scrapers ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---
# Put the code above in a module (e.g. `my_scrapers.py`) and pass its name as
# `bootstrap` so each worker process re-registers the scrapers. This is
# required on macOS/Windows, where multiprocessing uses the `spawn` start
# method and workers do not inherit the parent's registrations.

results = Runner.run(targets, bootstrap=["my_scrapers"])

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)

🖥️ Command line

spidur ships a small CLI. Point it at the module(s) that register your scrapers and pass one or more name=start_url targets:

spidur --module my_scrapers articles=https://example.com/articles

Results are written to stdout as JSON. Use -v for info-level logging.

🧠 How it works

Each Scraper subclass defines three single-purpose hooks:
- is_valid_url(url) — a pure predicate; keeps invalid URLs out of scope.
- discover_urls(page, known) — finds new pages to scrape.
- scrape_page(page, url) — extracts structured data.
The discover → validate → scrape orchestration is provided once by the base class as fetch_round(known), so subclasses never re-implement it.
You register scrapers in ScraperFactory.
The Runner:
- Spawns multiple processes.
- Executes all scrapers concurrently.
- Aggregates their results into a single dictionary keyed by scraper name.

🧪 Development

Install the project with its dev tooling and run the full quality gate:

poetry install
poetry run ruff check .     # lint
poetry run black --check .  # formatting
poetry run mypy             # static types (strict)
poetry run pytest           # tests

spidur is fully type-annotated and ships a py.typed marker, so downstream projects get type checking for free.

🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Jun 11, 2026

This version

0.3.0

Jun 11, 2026

0.2.0

Oct 8, 2025

0.1.0

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.3.0.tar.gz (9.3 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spidur-0.3.0-py3-none-any.whl (11.5 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file spidur-0.3.0.tar.gz.

File metadata

Download URL: spidur-0.3.0.tar.gz
Upload date: Jun 11, 2026
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`23e5df7a3a49825e9be46ef5469664da4abffecfed53cb4db203b1e6260e2d14`
MD5	`3d634369e423a817bfaf63e123d097b7`
BLAKE2b-256	`77542d384bd5d69b7fc376c35c6604756b6e69048d1fad74be04b183471eaa09`

See more details on using hashes here.

File details

Details for the file spidur-0.3.0-py3-none-any.whl.

File metadata

Download URL: spidur-0.3.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for spidur-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f252438f5808ea709361d05d96bfe6c18b384a477cc3d6e59295ef3ba60ec28b`
MD5	`164ac8f73d35be71de4f1eed775fd098`
BLAKE2b-256	`6da41b1a6fecd5f5dce2c6cc77598537ee2871dfde53c24578888b7bd20ad663`

See more details on using hashes here.

spidur 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

spidur 🕷️

✨ Core ideas

📦 Install

⚡ Example

🖥️ Command line

🧠 How it works

🧪 Development

🧩 Why “spidur”?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes