Skip to main content

🕷️ A lightweight, generic parallel runner for custom scrapers

Project description

spidur 🕷️

PyPI version License Tests

🕷️ spidur is a tiny, hackable framework for running custom scrapers in parallel.

  • No business logic
  • Just a base class + registry + runner
  • Multiprocessing + async friendly

✨ Features

  • Zero assumptions — bring your own scraper code.
  • Base class for scrapers — implement 2 methods and you’re done.
  • Parallel execution — run across all CPU cores.
  • OSS-style — small, clean, and easy to hack.

📦 Install

pip install spidur

Or install with poetry

poetry add spidur

Quickstart

from spidur.base import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner

class MyScraper(Scraper):
    async def discover_urls(self, page, known, overwrite=False):
        return ["http://example.com/1", "http://example.com/2"]

    async def scrape_page(self, page, url):
        return {"url": url, "data": "demo"}

    async def fetch(self, known, overwrite=False):
        urls = await self.discover_urls(None, known)
        return [await self.scrape_page(None, u) for u in urls]

# register scraper
ScraperFactory.register("example", MyScraper)

# run
target = Target(name="example", start_url="http://example.com")
results = Runner.run([target], seen=set(), overwrite=True)

print(results)

Tests

poetry install
poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spidur-0.1.0.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spidur-0.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file spidur-0.1.0.tar.gz.

File metadata

  • Download URL: spidur-0.1.0.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.6 Darwin/24.5.0

File hashes

Hashes for spidur-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ed194e56f8564326012081b594e8ae6ab50035ed8224f0047995393b833bd272
MD5 651100fff692a8ee30b09af87c8f81b4
BLAKE2b-256 8e32ada10c27166c2c9ceb7a80ec59c9dfc701cfa37896bcad63dfcfb7df2d33

See more details on using hashes here.

File details

Details for the file spidur-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: spidur-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.12.6 Darwin/24.5.0

File hashes

Hashes for spidur-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f9f12f03fb99095899f4ab99d4b72835c69ab91dcd30333b6903257d45380a9
MD5 c2e0e6600f016b0aa66376d76f3e4cc5
BLAKE2b-256 89e3a24976d6d93b9738a1628971abc60a23a109f2f675b4f9d5e1cc20ebaca0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page