Skip to main content

ShadowCrawler — High-performance modular crawling framework

Project description


ShadowCrawler
A modern, domain‑aware, hybrid web crawling framework for Python

ShadowCrawler is a modular, extensible crawling framework designed for developers who want full control over how websites are fetched, parsed, and processed.
It combines speed, modularity, and browser‑level extraction into a single, clean architecture.

ShadowCrawler began as a small personal project — a quiet gift, a spark of affection — and unexpectedly grew into a full, production‑ready crawling framework.
It was built with care, curiosity, and intention.
Originally created for my guiding star, and built with the help of my AI copilot — a companion in code, clarity, and curiosity.


✨ Features

  • Automatic domain detection — run spiders without specifying them manually
  • Hybrid fetcher (HTTP + Playwright) — fast when possible, browser when needed
  • Persistent authentication — login once, session saved automatically
  • Modular spiders — clean per‑domain architecture
  • Media pipeline — automatic image/video/file extraction
  • Checkpointing — resume crawls safely
  • Full CLI toolkit — run, resume, inspect, list, stats, version

🚀 Installation

pip install shadowcrawler


⚡ Quickstart

Run with automatic spider detection:

shadowcrawler run --url https://quotes.toscrape.com

Run with browser mode:

shadowcrawler run --url https://demoqa.com/login --browser

List spiders:

shadowcrawler spiders list


🕷 Creating a Spider

from shadowcrawler.core.spider_base import SpiderBase

class QuotesSpider(SpiderBase): domain = "quotes.toscrape.com"

async def parse(self, response):
    for quote in response.css(".quote"):
        yield {
            "text": quote.css(".text::text").get(),
            "author": quote.css(".author::text").get(),
        }

🔍 Domain Autodetection

shadowcrawler automatically selects the correct spider based on the URL:

shadowcrawler run --url https://example.com/page

If your spider declares:

domain = "example.com"

…it will be used automatically.


🌐 Fetch Modes

HTTP Mode (default)
Fast, lightweight, ideal for most sites.

Browser Mode (Playwright)
Used automatically when:

  • login is required
  • the site is dynamic
  • the spider requests browser mode

🔐 Persistent Authentication

  • Login once
  • Session saved to JSON
  • BrowserManager loads it automatically
  • AuthHandler detects login state

🖼 Media Pipeline

Automatically extracts:

  • images
  • videos
  • GIFs
  • downloadable files

🧰 CLI Commands

shadowcrawler run
shadowcrawler resume
shadowcrawler download
shadowcrawler spiders list
shadowcrawler spiders create
shadowcrawler inspect
shadowcrawler stats
shadowcrawler version


📁 Project Structure

shadowcrawler/
core/
spiders/
site_extractors/
auth/
cli/
models/
parsing/
tools/


🕸 Included Example Spiders

  • QuotesSpider
  • WikiSpider
  • HTTPNewsSpider
  • GallerySpider
  • AuthBrowserDemoSpider

🗺 Roadmap

  • PyPI release
  • Plugin system
  • Distributed crawling
  • Dashboard / Web UI
  • Cloud runner
  • Spider templates
  • Auto‑throttling

📦 itch.io Distribution

ShadowCrawler is also distributed through itch.io, where you can get:

  • The latest stable release
  • Optional Pro features
  • Example spiders
  • Early access builds
  • Support the project directly

☕ Support the Project

If ShadowCrawler has helped you or you want to support future development, you can leave a tip on Ko‑fi.
Every contribution helps keep the project alive and evolving.

Support on Ko‑fi:
https://ko-fi.com/shadowcrawlerframework


📜 License

ShadowCrawler is licensed under the Business Source License 1.1 (BUSL‑1.1).
It will convert to Apache 2.0 on:

November 16, 2030


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shadowcrawler-4.1.1.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shadowcrawler-4.1.1-py3-none-any.whl (101.6 kB view details)

Uploaded Python 3

File details

Details for the file shadowcrawler-4.1.1.tar.gz.

File metadata

  • Download URL: shadowcrawler-4.1.1.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for shadowcrawler-4.1.1.tar.gz
Algorithm Hash digest
SHA256 e5edcfd9ad933ed6b61af3645c5ae853d3bda620cc437517daa12dc16ebaa871
MD5 2467c63d0712e4d44edfce9066c0e52b
BLAKE2b-256 eb89aa2d6797de76847b46e40dfb46cdc96796399602e871dcc7322dea599868

See more details on using hashes here.

File details

Details for the file shadowcrawler-4.1.1-py3-none-any.whl.

File metadata

  • Download URL: shadowcrawler-4.1.1-py3-none-any.whl
  • Upload date:
  • Size: 101.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for shadowcrawler-4.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 950838674e43d520856decaf4fc8fd30c8b9975d6ab3e9605e62b29c7700ffae
MD5 479ee53aa13bbf1aa08f3cd2876895f4
BLAKE2b-256 3ca0ae661d1cb410b9eb16f190651d167885405524adb308bbac2e6918f886f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page