🕷️ A lightweight, generic parallel runner for custom scrapers
Project description
spidur 🕷️
spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.
✨ Core ideas
- Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
- Parallel execution — utilizes all CPU cores.
- Async + multiprocessing safe — works across async methods and process pools.
- No opinions — you control discovery, validation, and scraping logic.
- Results collected automatically — each scraper contributes to a single aggregated result set.
📦 Install
pip install spidur
or with Poetry:
poetry add spidur
⚡ Example
from typing import Any
from spidur import Runner, Scraper, ScraperFactory, ScrapeResult, Target
# --- define your scrapers ---
# Implement three small hooks. The discover -> validate -> scrape loop is
# provided for you by Scraper.fetch_round(), so you never write it yourself.
class ArticleScraper(Scraper):
def is_valid_url(self, url: str) -> bool:
return url.startswith("https://example.com/articles/")
async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
return [
"https://example.com/articles/1",
"https://example.com/articles/2",
]
async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
return {"type": "article", "url": url, "data": f"Content of {url}"}
class CommentScraper(Scraper):
def is_valid_url(self, url: str) -> bool:
return url.startswith("https://example.com/comments/")
async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
return [
"https://example.com/comments/1",
"https://example.com/comments/2",
]
async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
return {"type": "comment", "url": url, "data": f"Comments from {url}"}
# --- register both scrapers ---
ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)
# --- define your scrape targets ---
targets = [
Target(name="articles", start_url="https://example.com/articles"),
Target(name="comments", start_url="https://example.com/comments"),
]
# --- run them all in parallel ---
# Put the code above in a module (e.g. `my_scrapers.py`) and pass its name as
# `bootstrap` so each worker process re-registers the scrapers. This is
# required on macOS/Windows, where multiprocessing uses the `spawn` start
# method and workers do not inherit the parent's registrations.
results = Runner.run(targets, bootstrap=["my_scrapers"])
for name, items in results.items():
print(f"Results from {name}:")
for item in items:
print(" →", item)
🖥️ Command line
spidur ships a small CLI. Point it at the module(s) that register your
scrapers and pass one or more name=start_url targets:
spidur --module my_scrapers articles=https://example.com/articles
Results are written to stdout as JSON. Use -v for info-level logging.
🧠 How it works
-
Each
Scrapersubclass defines three single-purpose hooks:is_valid_url(url)— a pure predicate; keeps invalid URLs out of scope.discover_urls(page, known)— finds new pages to scrape.scrape_page(page, url)— extracts structured data.
The discover → validate → scrape orchestration is provided once by the base class as
fetch_round(known), so subclasses never re-implement it. -
You register scrapers in
ScraperFactory. -
The
Runner:- Spawns multiple processes.
- Executes all scrapers concurrently.
- Aggregates their results into a single dictionary keyed by scraper name.
🧪 Development
Install the project with its dev tooling and run the full quality gate:
poetry install
poetry run ruff check . # lint
poetry run black --check . # formatting
poetry run mypy # static types (strict)
poetry run pytest # tests
spidur is fully type-annotated and ships a py.typed marker, so downstream
projects get type checking for free.
🧩 Why “spidur”?
Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spidur-0.3.0.tar.gz.
File metadata
- Download URL: spidur-0.3.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23e5df7a3a49825e9be46ef5469664da4abffecfed53cb4db203b1e6260e2d14
|
|
| MD5 |
3d634369e423a817bfaf63e123d097b7
|
|
| BLAKE2b-256 |
77542d384bd5d69b7fc376c35c6604756b6e69048d1fad74be04b183471eaa09
|
File details
Details for the file spidur-0.3.0-py3-none-any.whl.
File metadata
- Download URL: spidur-0.3.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f252438f5808ea709361d05d96bfe6c18b384a477cc3d6e59295ef3ba60ec28b
|
|
| MD5 |
164ac8f73d35be71de4f1eed775fd098
|
|
| BLAKE2b-256 |
6da41b1a6fecd5f5dce2c6cc77598537ee2871dfde53c24578888b7bd20ad663
|