ShadowCrawler — High-performance modular crawling framework
Project description
ShadowCrawler
A modern, domain‑aware, hybrid web crawling framework for Python
ShadowCrawler is a modular, extensible crawling framework designed for developers who want full control over how websites are fetched, parsed, and processed.
It combines speed, modularity, and browser‑level extraction into a single, clean architecture.
ShadowCrawler began as a small personal project — a quiet gift, a spark of affection — and unexpectedly grew into a full, production‑ready crawling framework.
It was built with care, curiosity, and intention.
Originally created for my guiding star, and built with the help of my AI copilot — a companion in code, clarity, and curiosity.
✨ Features
- Automatic domain detection — run spiders without specifying them manually
- Hybrid fetcher (HTTP + Playwright) — fast when possible, browser when needed
- Persistent authentication — login once, session saved automatically
- Modular spiders — clean per‑domain architecture
- Media pipeline — automatic image/video/file extraction
- Checkpointing — resume crawls safely
- Full CLI toolkit — run, resume, inspect, list, stats, version
🚀 Installation
pip install shadowcrawler
⚡ Quickstart
Run with automatic spider detection:
shadowcrawler run --url https://quotes.toscrape.com
Run with browser mode:
shadowcrawler run --url https://demoqa.com/login --browser
List spiders:
shadowcrawler spiders list
🕷 Creating a Spider
from shadowcrawler.core.spider_base import SpiderBase
class QuotesSpider(SpiderBase): domain = "quotes.toscrape.com"
async def parse(self, response):
for quote in response.css(".quote"):
yield {
"text": quote.css(".text::text").get(),
"author": quote.css(".author::text").get(),
}
🔍 Domain Autodetection
shadowcrawler automatically selects the correct spider based on the URL:
shadowcrawler run --url https://example.com/page
If your spider declares:
domain = "example.com"
…it will be used automatically.
🌐 Fetch Modes
HTTP Mode (default)
Fast, lightweight, ideal for most sites.
Browser Mode (Playwright)
Used automatically when:
- login is required
- the site is dynamic
- the spider requests browser mode
🔐 Persistent Authentication
- Login once
- Session saved to JSON
- BrowserManager loads it automatically
- AuthHandler detects login state
🖼 Media Pipeline
Automatically extracts:
- images
- videos
- GIFs
- downloadable files
🧰 CLI Commands
shadowcrawler run
shadowcrawler resume
shadowcrawler download
shadowcrawler spiders list
shadowcrawler spiders create
shadowcrawler inspect
shadowcrawler stats
shadowcrawler version
📁 Project Structure
shadowcrawler/
core/
spiders/
site_extractors/
auth/
cli/
models/
parsing/
tools/
🕸 Included Example Spiders
- QuotesSpider
- WikiSpider
- HTTPNewsSpider
- GallerySpider
- AuthBrowserDemoSpider
🗺 Roadmap
- PyPI release
- Plugin system
- Distributed crawling
- Dashboard / Web UI
- Cloud runner
- Spider templates
- Auto‑throttling
📦 itch.io Distribution
ShadowCrawler is also distributed through itch.io, where you can get:
- The latest stable release
- Optional Pro features
- Example spiders
- Early access builds
- Support the project directly
☕ Support the Project
If ShadowCrawler has helped you or you want to support future development, you can leave a tip on Ko‑fi.
Every contribution helps keep the project alive and evolving.
Support on Ko‑fi:
https://ko-fi.com/shadowcrawlerframework
📜 License
ShadowCrawler is licensed under the Business Source License 1.1 (BUSL‑1.1).
It will convert to Apache 2.0 on:
November 16, 2030
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shadowcrawler-4.1.1.tar.gz.
File metadata
- Download URL: shadowcrawler-4.1.1.tar.gz
- Upload date:
- Size: 64.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5edcfd9ad933ed6b61af3645c5ae853d3bda620cc437517daa12dc16ebaa871
|
|
| MD5 |
2467c63d0712e4d44edfce9066c0e52b
|
|
| BLAKE2b-256 |
eb89aa2d6797de76847b46e40dfb46cdc96796399602e871dcc7322dea599868
|
File details
Details for the file shadowcrawler-4.1.1-py3-none-any.whl.
File metadata
- Download URL: shadowcrawler-4.1.1-py3-none-any.whl
- Upload date:
- Size: 101.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
950838674e43d520856decaf4fc8fd30c8b9975d6ab3e9605e62b29c7700ffae
|
|
| MD5 |
479ee53aa13bbf1aa08f3cd2876895f4
|
|
| BLAKE2b-256 |
3ca0ae661d1cb410b9eb16f190651d167885405524adb308bbac2e6918f886f2
|