🕷️ A lightweight, generic parallel runner for custom scrapers
Project description
spidur 🕷️
spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.
✨ Core ideas
- Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
- Parallel execution — utilizes all CPU cores.
- Async + multiprocessing safe — works across async methods and process pools.
- No opinions — you control discovery, validation, and scraping logic.
- Results collected automatically — each scraper contributes to a single aggregated result set.
📦 Install
pip install spidur
or with Poetry:
poetry add spidur
⚡ Example
from spidur.core import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner
# --- define your base scrapers ---
class ArticleScraper(Scraper):
async def is_valid_url(self, url):
return url.startswith("https://example.com/articles/")
async def discover_urls(self, page, known):
return [
"https://example.com/articles/1",
"https://example.com/articles/2",
]
async def scrape_page(self, page, url):
return {"type": "article", "url": url, "data": f"Content of {url}"}
async def fetch_page(self, known):
urls = await self.discover_urls(None, known)
urls = [u for u in urls if await self.is_valid_url(u)]
return [await self.scrape_page(None, u) for u in urls]
class CommentScraper(Scraper):
async def is_valid_url(self, url):
return url.startswith("https://example.com/comments/")
async def discover_urls(self, page, known):
return [
"https://example.com/comments/1",
"https://example.com/comments/2",
]
async def scrape_page(self, page, url):
return {"type": "comment", "url": url, "data": f"Comments from {url}"}
async def fetch_page(self, known):
urls = await self.discover_urls(None, known)
urls = [u for u in urls if await self.is_valid_url(u)]
return [await self.scrape_page(None, u) for u in urls]
# --- register both scrapers for the same domain ---
ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)
# --- define your scrape targets ---
targets = [
Target(name="articles", start_url="https://example.com/articles"),
Target(name="comments", start_url="https://example.com/comments"),
]
# --- run them all in parallel ---
results = Runner.run(targets, seen=set())
for name, items in results.items():
print(f"Results from {name}:")
for item in items:
print(" →", item)
🧠 How it works
-
Each
Scrapersubclass defines:is_valid_url(url)— ensures no invalid or duplicate URLs are processed.discover_urls()— finds new pages to scrape.scrape_page()— extracts structured data.fetch_page()— orchestrates the above.
-
You register scrapers in
ScraperFactory. -
The
Runner:- Spawns multiple processes.
- Executes all scrapers concurrently.
- Aggregates their results into a single dictionary keyed by scraper name.
🧪 Running tests
poetry install
poetry run pytest
or with plain pip:
pip install -e .
pytest
🧩 Why “spidur”?
Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spidur-0.2.0.tar.gz.
File metadata
- Download URL: spidur-0.2.0.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9
|
|
| MD5 |
32bc0d065a5f55f95de4d8926ff45d1c
|
|
| BLAKE2b-256 |
4ff53681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb
|
Provenance
The following attestation bundles were made for spidur-0.2.0.tar.gz:
Publisher:
ci.yaml on ra0x3/spidur
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spidur-0.2.0.tar.gz -
Subject digest:
6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9 - Sigstore transparency entry: 591144860
- Sigstore integration time:
-
Permalink:
ra0x3/spidur@46c8daa0d2426f9246f2c23cf25e5bd39c531889 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ra0x3
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@46c8daa0d2426f9246f2c23cf25e5bd39c531889 -
Trigger Event:
push
-
Statement type:
File details
Details for the file spidur-0.2.0-py3-none-any.whl.
File metadata
- Download URL: spidur-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6
|
|
| MD5 |
693e2e601c8c831b34ab5d15b67886c2
|
|
| BLAKE2b-256 |
af29019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e
|
Provenance
The following attestation bundles were made for spidur-0.2.0-py3-none-any.whl:
Publisher:
ci.yaml on ra0x3/spidur
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spidur-0.2.0-py3-none-any.whl -
Subject digest:
3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6 - Sigstore transparency entry: 591144871
- Sigstore integration time:
-
Permalink:
ra0x3/spidur@46c8daa0d2426f9246f2c23cf25e5bd39c531889 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ra0x3
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@46c8daa0d2426f9246f2c23cf25e5bd39c531889 -
Trigger Event:
release
-
Statement type: