Capture the web, your way. A modern, async, cross-platform web scraper.

These details have not been verified by PyPI

Project links

Project description

SiteSavvy

Capture the web, your way.

A modern, async, cross-platform web scraper that mirrors entire sites or extracts their readable text — and exports the result as HTML, Markdown, plain text, PDF, EPUB or a single ZIP archive.

Built with aiohttp, BeautifulSoup + lxml, Typer + Rich, with optional Playwright headless rendering for JavaScript-heavy pages.

Features

Two crawl modes
- full — recursively download every reachable resource (HTML, CSS, JS, images, PDFs, fonts, …) preserving the original directory hierarchy.
- text — extract the readable text from each HTML page (strips scripts, navigation, ads) and store it in your chosen format.
Six output formats (repeatable --format): html, md, txt, pdf, epub, zip.
Polite by default: respects robots.txt, enforces a per-host delay, and auto-throttles on 429 / 5xx responses.
Resume & incremental: a JSON manifest records every fetched URL, its local path and ETag / Last-Modified; --resume skips completed work and --incremental re-downloads only what changed.
Concurrency control with a global semaphore and per-host locks.
Dry-run mode that lists the URLs that would be fetched.
Headless rendering via Playwright (falls back to aiohttp automatically).
Fine-grained --download-types filtering: html,css,js,img,pdf,other.
External-link gating — stays on the start host unless you pass --external.
Rich CLI with progress tables and coloured output.
Cross-platform — runs on Linux, macOS and Windows; ships a CI matrix for all three.

Installation

From PyPI (once published)

pip install sitesavvy

From source (development)

git clone https://github.com/your-org/sitesavvy.git
cd sitesavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless

A plain pip install -r requirements.txt is also supported if you prefer to skip the PEP 517 build.

Quick start

Full-site mirror → ZIP

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

Text-only crawl → Markdown + EPUB

sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader

Dry-run (list URLs only)

sitesavvy crawl https://example.com --dry-run --depth 1

Resume an interrupted crawl

sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out

Only re-download changed resources

sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out

Render JavaScript pages

sitesavvy crawl https://spa.example.com --headless --format html

Command reference

Flag	Default	Description
`url` (positional)	—	Starting URL.
`--depth INT`	`0`	Max link depth (`0` = unlimited).
`--mode {full,text}`	`full`	Full-site download or text-only extraction.
`--format …`	`html`	Output format, repeatable: `html md txt pdf epub zip`.
`--out-dir PATH`	CWD	Destination folder.
`--concurrency N`	`4`	Simultaneous HTTP requests.
`--user-agent STR`	browser-like	Custom `User-Agent` header.
`--respect-robots` / `--no-respect-robots`	on	Obey `robots.txt`.
`--delay SECS`	`0.5`	Polite delay between same-host requests.
`--resume`	off	Skip URLs already completed in the manifest.
`--manifest FILE`	`<out-dir>/manifest.json`	Manifest path.
`--dry-run`	off	List URLs that would be fetched.
`--headless`	off	Render JS pages with Playwright.
`--rate-limit {auto,fixed}`	`auto`	Back off on 429/5xx, or use fixed delay.
`--download-types …`	all	Comma-separated: `html,css,js,img,pdf,other`.
`--incremental`	off	Re-download only changed resources (conditional GET).
`--external`	off	Follow cross-domain links.
`--force`	off	Proceed even if `robots.txt` disallows the start URL.
`--timeout SECS`	`30`	Per-request timeout.
`--verbose` / `-v`	off	Enable debug logging.

Auxiliary commands:

sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show which optional backends are installed
sitesavvy --version

Export-format matrix

Format	Mode `full`	Mode `text`	Backend
`html`	original bytes, hierarchy preserved	—	built-in
`md`	—	`markdownify` (ATX headings, links absolute)	`markdownify`
`txt`	—	`html2text` (no hard wrap)	`html2text`
`pdf`	—	WeasyPrint	`weasyprint`
`epub`	—	`ebooklib`, one chapter per page	`ebooklib`
`zip`	archive of the whole crawl	archive of the whole crawl	`zipfile`

Sample Markdown output:

# Page Title

## A heading

Some paragraph text with a [link](https://example.com/page).

Architecture

sitesavvy/
├── __init__.py          # package metadata
├── __main__.py          # python -m sitesavvy
├── __about__.py         # version
├── config.py            # CrawlConfig + enums
├── models.py            # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py         # normalisation, link extraction, path mapping
├── robots.py            # async robots.txt (reppy or stdlib fallback)
├── conversions.py       # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py          # resume / incremental state
├── headless.py          # Playwright fetcher
├── crawler.py           # the Crawler engine
├── legal.py             # disclaimer text
├── cli.py               # Typer + Rich CLI
└── main.py              # console-script entry point

Networking layer: aiohttp (primary) with an optional Playwright headless browser for JS-rendered pages. HTML parsing uses beautifulsoup4 + lxml. robots.txt is parsed with reppy when available, otherwise with the stdlib urllib.robotparser.

Troubleshooting

HTTP 429 Too Many Requests — lower --concurrency, raise --delay, and keep --rate-limit auto (default) so SiteSavvy backs off automatically.
Large sites — set --depth to bound the crawl, run with --dry-run first to estimate scope, and use --resume so an interruption doesn't waste work.
PDF export fails — WeasyPrint needs Pango/Cairo system libraries. On Debian/Ubuntu: apt install libpango-1.0-0 libpangoft2-1.0-0. On macOS: brew install pango. The other formats keep working even if PDF is missing.
Headless mode crashes — run playwright install chromium once after installing the package. Without it, SiteSavvy transparently falls back to aiohttp.
robots.txt disallows … — by default SiteSavvy honours robots.txt. Add --force only if you have permission and accept responsibility.

Legal & ethics

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Run sitesavvy legal to read the full disclaimer. Licensed under the MIT License.

Contributing

Pull requests are welcome! Please run the full check suite before submitting:

ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing

Coverage must stay at or above 90 %. See the Developer Guide for the project layout, release process and binary-building instructions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Jun 24, 2026

0.5.0

Jun 23, 2026

This version

0.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitesavvy-0.1.0.tar.gz (28.3 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sitesavvy-0.1.0-py3-none-any.whl (31.9 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file sitesavvy-0.1.0.tar.gz.

File metadata

Download URL: sitesavvy-0.1.0.tar.gz
Upload date: Jun 22, 2026
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e2565bedc9fc0b8548291a262557d52d20caa9284245cc19b4743d33d07c93b9`
MD5	`18b89b26c5046e65c7822196a4231510`
BLAKE2b-256	`a0b8364010c644bc2e1bfa4eed01c9b66ecbcc6a24945867ff9f2f08538a310d`

See more details on using hashes here.

File details

Details for the file sitesavvy-0.1.0-py3-none-any.whl.

File metadata

Download URL: sitesavvy-0.1.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 31.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3fccd982e2bfa019d028697ecc8fdaaf7223c1a9dff4cda043e59b58e1f0bff`
MD5	`6cc40ea37d8238d2ea12eb2d3b967864`
BLAKE2b-256	`aac6c6f33fd9b4bf78f3762ce7e8b74c37c76229d942fd60330bf88689148ba9`

See more details on using hashes here.

sitesavvy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SiteSavvy

Features

Installation

From PyPI (once published)

From source (development)

Quick start

Full-site mirror → ZIP

Text-only crawl → Markdown + EPUB

Dry-run (list URLs only)

Resume an interrupted crawl

Only re-download changed resources

Render JavaScript pages

Command reference

Export-format matrix

Architecture

Troubleshooting

Legal & ethics

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes