Skip to main content

🕷️ A lightweight, generic parallel runner for custom scrapers

Project description

spidur 🕷️

PyPI version License Tests

spidur is a lightweight, hackable framework for running multiple custom scrapers in parallel — even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.


✨ Core ideas

  • Multiple scrapers per domain — handle different content types (articles, images, comments, etc.) simultaneously.
  • Parallel execution — utilizes all CPU cores.
  • Async + multiprocessing safe — works across async methods and process pools.
  • No opinions — you control discovery, validation, and scraping logic.
  • Results collected automatically — each scraper contributes to a single aggregated result set.

📦 Install

pip install spidur

or with Poetry:

poetry add spidur

⚡ Example

from spidur.core import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner


# --- define your base scrapers ---

class ArticleScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "article", "url": url, "data": f"Content of {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


class CommentScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


# --- register both scrapers for the same domain ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---

results = Runner.run(targets, seen=set())

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)

🧠 How it works

  1. Each Scraper subclass defines:

    • is_valid_url(url) — ensures no invalid or duplicate URLs are processed.
    • discover_urls() — finds new pages to scrape.
    • scrape_page() — extracts structured data.
    • fetch_page() — orchestrates the above.
  2. You register scrapers in ScraperFactory.

  3. The Runner:

    • Spawns multiple processes.
    • Executes all scrapers concurrently.
    • Aggregates their results into a single dictionary keyed by scraper name.

🧪 Running tests

poetry install
poetry run pytest

or with plain pip:

pip install -e .
pytest

🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.2.0.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spidur-0.2.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file spidur-0.2.0.tar.gz.

File metadata

  • Download URL: spidur-0.2.0.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spidur-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9
MD5 32bc0d065a5f55f95de4d8926ff45d1c
BLAKE2b-256 4ff53681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for spidur-0.2.0.tar.gz:

Publisher: ci.yaml on ra0x3/spidur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spidur-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: spidur-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spidur-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6
MD5 693e2e601c8c831b34ab5d15b67886c2
BLAKE2b-256 af29019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for spidur-0.2.0-py3-none-any.whl:

Publisher: ci.yaml on ra0x3/spidur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page