🕷️ A lightweight, generic parallel runner for custom scrapers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

spidur 🕷️

spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.

✨ Core ideas

Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
Parallel execution — utilizes all CPU cores.
Async + multiprocessing safe — works across async methods and process pools.
No opinions — you control discovery, validation, and scraping logic.
Results collected automatically — each scraper contributes to a single aggregated result set.

📦 Install

pip install spidur

or with Poetry:

poetry add spidur

⚡ Example

from spidur.core import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner


# --- define your base scrapers ---

class ArticleScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "article", "url": url, "data": f"Content of {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


class CommentScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


# --- register both scrapers for the same domain ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---

results = Runner.run(targets, seen=set())

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)

🧠 How it works

Each Scraper subclass defines:
- is_valid_url(url) — ensures no invalid or duplicate URLs are processed.
- discover_urls() — finds new pages to scrape.
- scrape_page() — extracts structured data.
- fetch_page() — orchestrates the above.
You register scrapers in ScraperFactory.
The Runner:
- Spawns multiple processes.
- Executes all scrapers concurrently.
- Aggregates their results into a single dictionary keyed by scraper name.

🧪 Running tests

poetry install
poetry run pytest

or with plain pip:

pip install -e .
pytest

🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ra0x3

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Oct 8, 2025

0.1.0

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.2.0.tar.gz (4.5 kB view details)

Uploaded Oct 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spidur-0.2.0-py3-none-any.whl (5.7 kB view details)

Uploaded Oct 8, 2025 Python 3

File details

Details for the file spidur-0.2.0.tar.gz.

File metadata

Download URL: spidur-0.2.0.tar.gz
Upload date: Oct 8, 2025
Size: 4.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spidur-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9`
MD5	`32bc0d065a5f55f95de4d8926ff45d1c`
BLAKE2b-256	`4ff53681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spidur-0.2.0.tar.gz:

Publisher: ci.yaml on ra0x3/spidur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spidur-0.2.0.tar.gz
- Subject digest: 6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9
- Sigstore transparency entry: 591144860
- Sigstore integration time: Oct 8, 2025
Source repository:
- Permalink: ra0x3/spidur@46c8daa0d2426f9246f2c23cf25e5bd39c531889
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ra0x3
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yaml@46c8daa0d2426f9246f2c23cf25e5bd39c531889
- Trigger Event: push

File details

Details for the file spidur-0.2.0-py3-none-any.whl.

File metadata

Download URL: spidur-0.2.0-py3-none-any.whl
Upload date: Oct 8, 2025
Size: 5.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spidur-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6`
MD5	`693e2e601c8c831b34ab5d15b67886c2`
BLAKE2b-256	`af29019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for spidur-0.2.0-py3-none-any.whl:

Publisher: ci.yaml on ra0x3/spidur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: spidur-0.2.0-py3-none-any.whl
- Subject digest: 3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6
- Sigstore transparency entry: 591144871
- Sigstore integration time: Oct 8, 2025
Source repository:
- Permalink: ra0x3/spidur@46c8daa0d2426f9246f2c23cf25e5bd39c531889
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ra0x3
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yaml@46c8daa0d2426f9246f2c23cf25e5bd39c531889
- Trigger Event: release

spidur 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

spidur 🕷️

✨ Core ideas

📦 Install

⚡ Example

🧠 How it works

🧪 Running tests

🧩 Why “spidur”?

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance