Capture the web, your way. A modern, async, cross-platform web scraper.
Project description
SiteSavvy
Capture the web, your way.
A modern, async, cross-platform web scraper that mirrors entire sites or extracts their readable text — and exports the result as HTML, Markdown, plain text, PDF, EPUB or a single ZIP archive.
Built with aiohttp, BeautifulSoup + lxml,
Typer + Rich, with optional Playwright headless rendering for
JavaScript-heavy pages.
Features
- Two crawl modes
full— recursively download every reachable resource (HTML, CSS, JS, images, PDFs, fonts, …) preserving the original directory hierarchy.text— extract the readable text from each HTML page (strips scripts, navigation, ads) and store it in your chosen format.
- Six output formats (repeatable
--format):html,md,txt,pdf,epub,zip. - Polite by default: respects
robots.txt, enforces a per-host delay, and auto-throttles on429/5xxresponses. - Resume & incremental: a JSON manifest records every fetched URL, its
local path and
ETag/Last-Modified;--resumeskips completed work and--incrementalre-downloads only what changed. - Concurrency control with a global semaphore and per-host locks.
- Dry-run mode that lists the URLs that would be fetched.
- Headless rendering via Playwright (falls back to
aiohttpautomatically). - Fine-grained
--download-typesfiltering:html,css,js,img,pdf,other. - External-link gating — stays on the start host unless you pass
--external. - Rich CLI with progress tables and coloured output.
- Cross-platform — runs on Linux, macOS and Windows; ships a CI matrix for all three.
Installation
From PyPI (once published)
pip install sitesavvy
From source (development)
git clone https://github.com/your-org/sitesavvy.git
cd sitesavvy
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium # optional, only for --headless
A plain pip install -r requirements.txt is also supported if you prefer to
skip the PEP 517 build.
Quick start
Full-site mirror → ZIP
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
Text-only crawl → Markdown + EPUB
sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
Dry-run (list URLs only)
sitesavvy crawl https://example.com --dry-run --depth 1
Resume an interrupted crawl
sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
Only re-download changed resources
sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
Render JavaScript pages
sitesavvy crawl https://spa.example.com --headless --format html
Command reference
| Flag | Default | Description |
|---|---|---|
url (positional) |
— | Starting URL. |
--depth INT |
0 |
Max link depth (0 = unlimited). |
--mode {full,text} |
full |
Full-site download or text-only extraction. |
--format … |
html |
Output format, repeatable: html md txt pdf epub zip. |
--out-dir PATH |
CWD | Destination folder. |
--concurrency N |
4 |
Simultaneous HTTP requests. |
--user-agent STR |
browser-like | Custom User-Agent header. |
--respect-robots / --no-respect-robots |
on | Obey robots.txt. |
--delay SECS |
0.5 |
Polite delay between same-host requests. |
--resume |
off | Skip URLs already completed in the manifest. |
--manifest FILE |
<out-dir>/manifest.json |
Manifest path. |
--dry-run |
off | List URLs that would be fetched. |
--headless |
off | Render JS pages with Playwright. |
--rate-limit {auto,fixed} |
auto |
Back off on 429/5xx, or use fixed delay. |
--download-types … |
all | Comma-separated: html,css,js,img,pdf,other. |
--incremental |
off | Re-download only changed resources (conditional GET). |
--external |
off | Follow cross-domain links. |
--force |
off | Proceed even if robots.txt disallows the start URL. |
--timeout SECS |
30 |
Per-request timeout. |
--verbose / -v |
off | Enable debug logging. |
Auxiliary commands:
sitesavvy legal # print the legal / ethical disclaimer
sitesavvy info # show which optional backends are installed
sitesavvy --version
Export-format matrix
| Format | Mode full |
Mode text |
Backend |
|---|---|---|---|
html |
original bytes, hierarchy preserved | — | built-in |
md |
— | markdownify (ATX headings, links absolute) |
markdownify |
txt |
— | html2text (no hard wrap) |
html2text |
pdf |
— | WeasyPrint | weasyprint |
epub |
— | ebooklib, one chapter per page |
ebooklib |
zip |
archive of the whole crawl | archive of the whole crawl | zipfile |
Sample Markdown output:
# Page Title
## A heading
Some paragraph text with a [link](https://example.com/page).
Architecture
sitesavvy/
├── __init__.py # package metadata
├── __main__.py # python -m sitesavvy
├── __about__.py # version
├── config.py # CrawlConfig + enums
├── models.py # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py # normalisation, link extraction, path mapping
├── robots.py # async robots.txt (reppy or stdlib fallback)
├── conversions.py # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py # resume / incremental state
├── headless.py # Playwright fetcher
├── crawler.py # the Crawler engine
├── legal.py # disclaimer text
├── cli.py # Typer + Rich CLI
└── main.py # console-script entry point
Networking layer: aiohttp (primary) with an optional Playwright headless
browser for JS-rendered pages. HTML parsing uses beautifulsoup4 + lxml.
robots.txt is parsed with reppy when available, otherwise with the stdlib
urllib.robotparser.
Troubleshooting
HTTP 429 Too Many Requests— lower--concurrency, raise--delay, and keep--rate-limit auto(default) so SiteSavvy backs off automatically.- Large sites — set
--depthto bound the crawl, run with--dry-runfirst to estimate scope, and use--resumeso an interruption doesn't waste work. - PDF export fails — WeasyPrint needs Pango/Cairo system libraries. On
Debian/Ubuntu:
apt install libpango-1.0-0 libpangoft2-1.0-0. On macOS:brew install pango. The other formats keep working even if PDF is missing. - Headless mode crashes — run
playwright install chromiumonce after installing the package. Without it, SiteSavvy transparently falls back toaiohttp. robots.txt disallows …— by default SiteSavvy honoursrobots.txt. Add--forceonly if you have permission and accept responsibility.
Legal & ethics
SiteSavvy is provided for personal, non-commercial use only. Respect the
copyright, terms of service, and robots.txt of every site you crawl. The
authors assume no liability for misuse. Run sitesavvy legal to read the full
disclaimer. Licensed under the MIT License.
Contributing
Pull requests are welcome! Please run the full check suite before submitting:
ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing
Coverage must stay at or above 90 %. See the Developer Guide for the project layout, release process and binary-building instructions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitesavvy-0.1.0.tar.gz.
File metadata
- Download URL: sitesavvy-0.1.0.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2565bedc9fc0b8548291a262557d52d20caa9284245cc19b4743d33d07c93b9
|
|
| MD5 |
18b89b26c5046e65c7822196a4231510
|
|
| BLAKE2b-256 |
a0b8364010c644bc2e1bfa4eed01c9b66ecbcc6a24945867ff9f2f08538a310d
|
File details
Details for the file sitesavvy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sitesavvy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3fccd982e2bfa019d028697ecc8fdaaf7223c1a9dff4cda043e59b58e1f0bff
|
|
| MD5 |
6cc40ea37d8238d2ea12eb2d3b967864
|
|
| BLAKE2b-256 |
aac6c6f33fd9b4bf78f3762ce7e8b74c37c76229d942fd60330bf88689148ba9
|