Skip to main content

ShadowCrawler — High-performance modular crawling framework

Project description


ShadowCrawler

A modern, domain‑aware, hybrid web crawling framework for Python

ShadowCrawler is a modular, extensible crawling framework designed for developers who want full control over how websites are fetched, parsed, and processed.
It combines speed, modularity, and browser‑level extraction into a single, clean architecture.


❤️ Origin Story

ShadowCrawler began as a small personal project — a quiet gift, a spark of affection — and unexpectedly grew into a full, production‑ready crawling framework.
It was built with care, curiosity, and intention.
Originally created for my guiding star, and built with the help of my AI copilot — a companion in code, clarity, and curiosity.


✨ Features

  • Automatic domain detection
  • Hybrid fetcher (HTTP + Playwright)
  • Persistent authentication
  • Modular spiders
  • Media pipeline
  • Checkpointing
  • Full CLI toolkit

Requirements

  • Python 3.10+
  • Playwright installed:
    playwright install
    

🚀 Installation

pip install shadowcrawler

⚡ Quickstart

Run with automatic spider detection:

shadowcrawler run --url https://quotes.toscrape.com

Run with browser mode:

shadowcrawler run --url https://demoqa.com/login --browser

List spiders:

shadowcrawler spiders list

🕷 Creating a Spider

from shadowcrawler.core.spider_base import SpiderBase

class QuotesSpider(SpiderBase):
    domain = "quotes.toscrape.com"

    async def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "text": quote.css(".text::text").get(),
                "author": quote.css(".author::text").get(),
            }

🔍 Domain Autodetection

ShadowCrawler automatically selects the correct spider based on the URL:

shadowcrawler run --url https://example.com/page

If your spider declares:

domain = "example.com"

…it will be used automatically.


🌐 Fetch Modes

HTTP Mode (default)
Fast, lightweight, ideal for most sites.

Browser Mode (Playwright)
Used automatically when:

  • login is required
  • the site is dynamic
  • the spider requests browser mode

🔐 Persistent Authentication

  • Login once
  • Session saved to JSON
  • BrowserManager loads it automatically
  • AuthHandler detects login state

🖼 Media Pipeline

Automatically extracts:

  • images
  • videos
  • GIFs
  • downloadable files

🧰 CLI Commands

  • run
  • resume
  • download
  • spiders list
  • spiders create
  • inspect
  • stats
  • version

📁 Project Structure

shadowcrawler/
  core/
  spiders/
  site_extractors/
  auth/
  cli/
  models/
  parsing/
  tools/

🕸 Included Example Spiders

  • QuotesSpider
  • WikiSpider
  • HTTPNewsSpider
  • GallerySpider
  • AuthBrowserDemoSpider

🗺 Roadmap

  • PyPI release
  • Plugin system
  • Distributed crawling
  • Dashboard / Web UI
  • Cloud runner
  • Spider templates
  • Auto‑throttling

📦 itch.io Distribution

ShadowCrawler is also distributed through itch.io, where you can get:

  • The latest stable release
  • Optional Pro features
  • Example spiders
  • Early access builds
  • Support the project directly

☕ Support the Project

If ShadowCrawler has helped you or you want to support future development, you can leave a tip on Ko‑fi.
Every contribution helps keep the project alive and evolving.

https://ko-fi.com/shadowcrawlerframework

📜 License

ShadowCrawler is licensed under the Business Source License 1.1 (BUSL‑1.1).
It will convert to Apache 2.0 on November 16, 2030.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shadowcrawler-4.1.3.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shadowcrawler-4.1.3-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file shadowcrawler-4.1.3.tar.gz.

File metadata

  • Download URL: shadowcrawler-4.1.3.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for shadowcrawler-4.1.3.tar.gz
Algorithm Hash digest
SHA256 85701f392845bb8936c93dc57bbd13e212cbc6af080cb45789671594cba51e8d
MD5 acf41971c17dc90f03e6b38e01540f97
BLAKE2b-256 55c135e072515824257755b9c392130eb39770ba80db07888eeba633e21756b8

See more details on using hashes here.

File details

Details for the file shadowcrawler-4.1.3-py3-none-any.whl.

File metadata

  • Download URL: shadowcrawler-4.1.3-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for shadowcrawler-4.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c1cb05ea0192aa115c938354932aae16c3570e367a41b146abc82983d4f4822b
MD5 e449665d65672b1bb0c10b8608f0f122
BLAKE2b-256 615288fb05814c06b76c1fd0b7bbdda34c2b1445e7277e6a93127bdc6f8e31e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page