Skip to main content

Capture the web, your way. A modern, async, cross-platform web scraper.

Project description

SiteSavvy

Capture the web, your way.

A modern, async, cross-platform web scraper that mirrors entire sites or extracts their readable text — and exports the result as HTML, Markdown, plain text, PDF, EPUB or a single ZIP archive.

Built with aiohttp, BeautifulSoup + lxml, Typer + Rich, with optional Playwright headless rendering for JavaScript-heavy pages.


Features

  • Two crawl modes
    • full — recursively download every reachable resource (HTML, CSS, JS, images, PDFs, fonts, …) preserving the original directory hierarchy.
    • text — extract the readable text from each HTML page (strips scripts, navigation, ads) and store it in your chosen format.
  • Six output formats (repeatable --format): html, md, txt, pdf, epub, zip.
  • Polite by default: respects robots.txt, enforces a per-host delay, and auto-throttles on 429 / 5xx responses.
  • Resume & incremental: a JSON manifest records every fetched URL, its local path and ETag / Last-Modified; --resume skips completed work and --incremental re-downloads only what changed.
  • Concurrency control with a global semaphore and per-host locks.
  • Dry-run mode that lists the URLs that would be fetched.
  • Headless rendering via Playwright (falls back to aiohttp automatically).
  • Fine-grained --download-types filtering: html,css,js,img,pdf,other.
  • External-link gating — stays on the start host unless you pass --external.
  • Rich CLI with progress tables and coloured output.
  • Cross-platform — runs on Linux, macOS and Windows; ships a CI matrix for all three.

Installation

From PyPI (once published)

pip install sitesavvy

From source (development)

git clone https://github.com/your-org/sitesavvy.git
cd sitesavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless

A plain pip install -r requirements.txt is also supported if you prefer to skip the PEP 517 build.


Quick start

Full-site mirror → ZIP

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

Text-only crawl → Markdown + EPUB

sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader

Dry-run (list URLs only)

sitesavvy crawl https://example.com --dry-run --depth 1

Resume an interrupted crawl

sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out

Only re-download changed resources

sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out

Render JavaScript pages

sitesavvy crawl https://spa.example.com --headless --format html

Command reference

Flag Default Description
url (positional) Starting URL.
--depth INT 0 Max link depth (0 = unlimited).
--mode {full,text} full Full-site download or text-only extraction.
--format … html Output format, repeatable: html md txt pdf epub zip.
--out-dir PATH CWD Destination folder.
--concurrency N 4 Simultaneous HTTP requests.
--user-agent STR browser-like Custom User-Agent header.
--respect-robots / --no-respect-robots on Obey robots.txt.
--delay SECS 0.5 Polite delay between same-host requests.
--resume off Skip URLs already completed in the manifest.
--manifest FILE <out-dir>/manifest.json Manifest path.
--dry-run off List URLs that would be fetched.
--headless off Render JS pages with Playwright.
--rate-limit {auto,fixed} auto Back off on 429/5xx, or use fixed delay.
--download-types … all Comma-separated: html,css,js,img,pdf,other.
--incremental off Re-download only changed resources (conditional GET).
--external off Follow cross-domain links.
--force off Proceed even if robots.txt disallows the start URL.
--timeout SECS 30 Per-request timeout.
--verbose / -v off Enable debug logging.

Auxiliary commands:

sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show which optional backends are installed
sitesavvy --version

Export-format matrix

Format Mode full Mode text Backend
html original bytes, hierarchy preserved built-in
md markdownify (ATX headings, links absolute) markdownify
txt html2text (no hard wrap) html2text
pdf WeasyPrint weasyprint
epub ebooklib, one chapter per page ebooklib
zip archive of the whole crawl archive of the whole crawl zipfile

Sample Markdown output:

# Page Title

## A heading

Some paragraph text with a [link](https://example.com/page).

Architecture

sitesavvy/
├── __init__.py          # package metadata
├── __main__.py          # python -m sitesavvy
├── __about__.py         # version
├── config.py            # CrawlConfig + enums
├── models.py            # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py         # normalisation, link extraction, path mapping
├── robots.py            # async robots.txt (reppy or stdlib fallback)
├── conversions.py       # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py          # resume / incremental state
├── headless.py          # Playwright fetcher
├── crawler.py           # the Crawler engine
├── legal.py             # disclaimer text
├── cli.py               # Typer + Rich CLI
└── main.py              # console-script entry point

Networking layer: aiohttp (primary) with an optional Playwright headless browser for JS-rendered pages. HTML parsing uses beautifulsoup4 + lxml. robots.txt is parsed with reppy when available, otherwise with the stdlib urllib.robotparser.


Troubleshooting

  • HTTP 429 Too Many Requests — lower --concurrency, raise --delay, and keep --rate-limit auto (default) so SiteSavvy backs off automatically.
  • Large sites — set --depth to bound the crawl, run with --dry-run first to estimate scope, and use --resume so an interruption doesn't waste work.
  • PDF export fails — WeasyPrint needs Pango/Cairo system libraries. On Debian/Ubuntu: apt install libpango-1.0-0 libpangoft2-1.0-0. On macOS: brew install pango. The other formats keep working even if PDF is missing.
  • Headless mode crashes — run playwright install chromium once after installing the package. Without it, SiteSavvy transparently falls back to aiohttp.
  • robots.txt disallows … — by default SiteSavvy honours robots.txt. Add --force only if you have permission and accept responsibility.

Legal & ethics

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. The authors assume no liability for misuse. Run sitesavvy legal to read the full disclaimer. Licensed under the MIT License.


Contributing

Pull requests are welcome! Please run the full check suite before submitting:

ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing

Coverage must stay at or above 90 %. See the Developer Guide for the project layout, release process and binary-building instructions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitesavvy-0.1.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitesavvy-0.1.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file sitesavvy-0.1.0.tar.gz.

File metadata

  • Download URL: sitesavvy-0.1.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e2565bedc9fc0b8548291a262557d52d20caa9284245cc19b4743d33d07c93b9
MD5 18b89b26c5046e65c7822196a4231510
BLAKE2b-256 a0b8364010c644bc2e1bfa4eed01c9b66ecbcc6a24945867ff9f2f08538a310d

See more details on using hashes here.

File details

Details for the file sitesavvy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sitesavvy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3fccd982e2bfa019d028697ecc8fdaaf7223c1a9dff4cda043e59b58e1f0bff
MD5 6cc40ea37d8238d2ea12eb2d3b967864
BLAKE2b-256 aac6c6f33fd9b4bf78f3762ce7e8b74c37c76229d942fd60330bf88689148ba9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page