Capture the web, your way. A modern, async, cross-platform web scraper.

These details have not been verified by PyPI

Project links

Project description

SiteSavvy

Capture the web, your way.

SiteSavvy is a modern, async, cross-platform web scraper that mirrors entire sites or extracts their readable text — and now also an AI-powered research tool. Beyond the original HTML / Markdown / text / PDF / EPUB / ZIP exports, v0.5.0 adds LLM content extraction, per-page summaries and auto-categorization, RAG question-answering (sitesavvy ask "..."), an MCP server that exposes crawl / search / ask to Claude, Cursor and VS Code Copilot, and nine output formats including SQLite, WARC and Obsidian vaults. Point it at a docs site, a blog, a shop or a whole wiki, and SiteSavvy will quietly fetch, parse, summarize, index and answer questions about it — politely, resumably, and on your laptop.

pip install sitesavvy

A basic crawl:

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

An AI-powered research crawl that extracts clean text, summarizes every page and builds a RAG index you can ask questions of:

sitesavvy crawl https://example.com --mode text --format md sqlite \
    --summarize --categorize --index --out-dir ./research

Then ask a question in natural language:

sitesavvy ask "what does this site say about pricing?"

What's new in v0.6.0

Note: v0.6.0 completes the feature set with 7 new modules covering pagination, authentication, proxy/Tor, stealth, recipes, docs-site mode, and offline full-text search.

📄 Pagination awareness — follows rel="next" / ?page=N without consuming the depth budget
🔐 Authenticated crawling — --login-url / --login-user / --login-pass with CSRF detection
🌐 Proxy / Tor / SOCKS5 — --proxy http://... or socks5://... (SOCKS via optional aiohttp_socks)
🥸 Stealth mode — --stealth rotates UA from 15 real browser strings, jitters timing, realistic headers
🍳 Recipe mode — --recipe-mode detects schema.org Recipes and builds a sitesavvy-cookbook.epub
📚 Docs-site mode — --docs-mode strips sidebars/nav for mkdocs / Docusaurus / Sphinx / ReadTheDocs, builds a TOC
🔍 Offline search — --offline-search builds a self-contained search.html + index.json (works from file://)

See the CHANGELOG for the full list.

What's new in v0.5.0

Note: v0.5.0 is a major release that turns SiteSavvy from a scraper into an AI-powered research assistant. The headline additions:

🤖 AI: LLM content extraction, per-page summaries, auto-categorization
💬 RAG: sitesavvy ask "..." — ask questions about crawled sites
🔌 MCP server: expose crawl / search / ask to Claude, Cursor, VS Code Copilot
📋 9 output formats: html, md, txt, pdf, epub, zip, sqlite, warc, obsidian
🎯 URL patterns (--include / --exclude), CSS scope, sitemap seeding
📊 Budgets (--max-pages / --max-bytes / --max-time), HTML reports
🔄 Content diff between crawls, Wayback Machine archiving
⚙️ Config files + 6 presets (docs / blog / wiki / shop / archive / research)

See the CHANGELOG for the full list, and the Architecture section below for the new module layout.

Features

Crawl modes

full — recursively download every reachable resource (HTML, CSS, JS, images, PDFs, fonts, …) preserving the original directory hierarchy.
text — extract the readable text from each HTML page (strips scripts, navigation, ads) and store it in your chosen format.

Output formats

Nine formats, repeatable via --format:

Format	Mode `full`	Mode `text`	Backend
`html`	original bytes, hierarchy preserved	—	built-in
`md`	—	`markdownify` (ATX headings, links absolute)	`markdownify`
`txt`	—	`html2text` (no hard wrap)	`html2text`
`pdf`	—	WeasyPrint	`weasyprint`
`epub`	—	`ebooklib`, one chapter per page	`ebooklib`
`zip`	archive of the whole crawl	archive of the whole crawl	`zipfile`
`sqlite`	one row per page (URL, title, text, meta)	one row per page + embeddings	`sqlite3`
`warc`	ISO 28500:2017 archive (replayweb.page-compatible)	—	built-in
`obsidian`	—	Markdown vault with `[[wikilinks]]` + frontmatter	built-in

AI & intelligence

LLM content extraction (--ai-extract) — let an LLM pull the main article out of cluttered pages in text mode.
Per-page summaries + site digest (--summarize) — every page gets a one-paragraph summary, plus a top-level digest.md overview.
Auto-categorization (--categorize) — each page is tagged with an AI-derived category (e.g. tutorial, pricing, API reference).
RAG question-answering (sitesavvy ask "...") — semantic search over a SQLite vector store (cosine similarity) plus an LLM that synthesizes an answer from the top-k retrieved chunks, with source URLs.

MCP server

sitesavvy mcp starts a Model Context Protocol server (stdio transport) that exposes SiteSavvy to AI assistants. Six tools are available:

Tool	What it does
`crawl`	Run a crawl with the same options as the CLI.
`list_pages`	List pages in a finished crawl's manifest.
`search`	Full-text search over a crawled mirror.
`get_page`	Fetch the body of a single page by URL.
`ask`	RAG question-answering over a crawled mirror.
`info`	Report installed backends and AI configuration.

Configure it once in your client (see MCP server below) and your assistant can crawl, read and reason about the web without leaving the chat.

Scraping power

Sitemap seeding (--sitemap) — discover and parse sitemap.xml, including sitemap indexes.
RSS / Atom feed discovery (--feeds) — seed URLs from feed entries.
URL pattern filtering — --include / --exclude accept globs (* / **) or re:<regex> patterns; the start URL is always allowed.
CSS scope (--scope "main") — restricts both link discovery and content extraction to a subtree.
Budgets — --max-pages, --max-bytes, --max-time stop crawls cleanly and leave a resumable manifest behind.
Proxy support — --proxy http://host:port or socks5://host:port.
Screenshots (--screenshots) — capture full-page PNGs in headless mode.
Headless rendering (--headless) via Playwright (falls back to aiohttp automatically when no browser binary is installed).

Politeness

Robots.txt compliance by default, with --force override.
Per-host delay (--delay) and auto-throttle on 429 / 5xx (--rate-limit auto, the default) with exponential back-off plus jitter.
Resume (--resume) — skip URLs already completed in the manifest.
Incremental (--incremental) — re-download only changed resources via conditional GETs (ETag / Last-Modified, 304 Not Modified).
External-link gating — stays on the start host unless you pass --external.

Config & UX

Config files — sitesavvy.toml with [default] and [profiles.<name>] sections; --config and --profile flags load them.
6 built-in presets — --preset docs|blog|wiki|shop|archive|research (see Presets).
Interactive wizard — sitesavvy new walks you through a few prompts and prints a ready-to-run crawl command (or writes a sitesavvy.toml).
sitesavvy init-config — writes an example sitesavvy.toml.
sitesavvy list-presets — lists available presets.
HTML report (--report) — writes a self-contained crawl-report.html summarizing URLs fetched, failures, formats produced and AI summaries.
Content diff (sitesavvy diff <old> <new> <old-dir> <new-dir>) — compares two crawls and reports added / removed / changed pages as Markdown.
Wayback Machine (--archive) — submits every fetched page to web.archive.org (fire-and-forget).
Rich CLI with progress tables, coloured output and --verbose / -v debug logging.

Installation

Option 1 — From PyPI (recommended)

pip install sitesavvy

For the MCP server, install the optional extra:

pip install 'sitesavvy[mcp]'

Verify the install:

sitesavvy --version
sitesavvy info   # show which optional backends are installed + AI config

Option 2 — Stand-alone binary (no Python required)

Download the right archive for your OS from the latest release, extract it, and run:

OS	Asset	How to run
Linux (x86_64)	`sitesavvy-0.5.0-linux-x86_64.tar.gz`	`tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help`
macOS (x86_64)	`sitesavvy-0.5.0-macos-x86_64.tar.gz`	`tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help`
Windows (x86_64)	`sitesavvy-0.5.0-windows-x86_64.exe`	Double-click, or run `sitesavvy.exe --help` in PowerShell

These are single-file PyInstaller executables — no Python installation needed.

Option 3 — From source (development)

git clone https://github.com/Bloody-Crow/SiteSavvy.git
cd SiteSavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless / --screenshots

A plain pip install -r requirements.txt is also supported if you prefer to skip the PEP 517 build.

AI configuration

AI features (extraction, summaries, categorization, RAG ask) talk to any OpenAI-compatible endpoint: OpenAI itself, Ollama, vLLM, LM Studio, Groq, Together, OpenRouter, and any other server that implements the /chat/completions and /embeddings routes.

Configure via environment variables:

Variable	Default	Purpose
`SITESAVVY_LLM_BASE_URL`	`https://api.openai.com/v1`	OpenAI-compatible API base URL.
`SITESAVVY_LLM_API_KEY`	(empty)	API key. Required for AI features.
`SITESAVVY_LLM_MODEL`	`gpt-4o-mini`	Chat model for extraction / summaries / `ask`.
`SITESAVVY_EMBED_MODEL`	`text-embedding-3-small`	Embedding model for the RAG index.

Example — OpenAI:

export SITESAVVY_LLM_API_KEY="sk-..."
sitesavvy crawl https://example.com --mode text --format md --summarize --index

Example — local Ollama (no API key needed):

export SITESAVVY_LLM_BASE_URL="http://localhost:11434/v1"
export SITESAVVY_LLM_API_KEY="ollama"   # any non-empty string
export SITESAVVY_LLM_MODEL="llama3.1"
export SITESAVVY_EMBED_MODEL="nomic-embed-text"
sitesavvy crawl https://example.com --mode text --summarize --index

Note: AI features are strictly opt-in. SiteSavvy never makes network calls to an LLM provider unless you explicitly pass --ai-extract, --summarize, --categorize or --index, or invoke sitesavvy ask. Without an API key the AI flags are silently skipped and the crawl completes normally — see Troubleshooting.

Run sitesavvy info at any time to see the configured base URL, model, and whether an API key is set.

MCP server

sitesavvy mcp runs SiteSavvy as a Model Context Protocol server over stdio, exposing the six tools listed in Features to any MCP-compatible client.

Claude Desktop

Add an entry to your Claude Desktop config (macOS: ~/Library/Application Support/Claude/claude_desktop_config.json, Windows: %APPDATA%\Claude\claude_desktop_config.json):

{
  "mcpServers": {
    "sitesavvy": {
      "command": "sitesavvy",
      "args": ["mcp"]
    }
  }
}

Restart Claude Desktop. You can now ask things like "crawl https://docs.example.com and tell me how their auth flow works" and Claude will call the crawl and ask tools for you.

Cursor, VS Code Copilot, and other MCP clients

The same command: sitesavvy, args: ["mcp"] snippet works in any client that speaks MCP. See your client's docs for where to register MCP servers.

Note: The MCP server requires the optional mcp package. Install it with pip install 'sitesavvy[mcp]'. If it's missing, sitesavvy mcp prints a helpful error pointing at the install command.

Quick start

Full-site mirror → ZIP

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

Text-only crawl → Markdown + EPUB

sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader

AI research crawl → Markdown + SQLite + summaries + RAG index

sitesavvy crawl https://example.com \
    --mode text --format md sqlite \
    --summarize --categorize --index \
    --out-dir ./research

Ask a question about a crawled site

sitesavvy ask "what does this site say about pricing?"

ask uses the RAG index built by --index (default location ./sitesavvy.index.db). Pass --index /path/to/index.db and --top-k 10 to customise retrieval.

Preset crawl (one-shot sensible defaults)

sitesavvy crawl https://docusaurus.io --preset docs
sitesavvy crawl https://shop.example.com --preset shop
sitesavvy crawl https://wiki.example.com --preset wiki

See Presets for what each one does.

Dry-run, resume, incremental, headless

# List URLs that would be fetched, without writing anything
sitesavvy crawl https://example.com --dry-run --depth 1

# Resume an interrupted crawl
sitesavvy crawl https://example.com --depth 3 --resume \
    --manifest ./out/manifest.json --out-dir ./out

# Only re-download changed resources
sitesavvy crawl https://example.com --incremental \
    --manifest ./out/manifest.json --out-dir ./out

# Render JavaScript-heavy pages with Playwright
sitesavvy crawl https://spa.example.com --headless --format html

v0.6.0: offline-searchable mirror, recipes, auth, stealth

# Download a site AND build a self-contained offline search UI
# → open ./out/search.html in any browser (works from file://)
sitesavvy crawl https://docs.example.com --offline-search --format html --out-dir ./out

# Scrape a recipe site and get a cookbook EPUB
# → ./out/sitesavvy-cookbook.epub (one chapter per recipe)
sitesavvy crawl https://recipes.example.com --recipe-mode --out-dir ./out

# Crawl a site that requires a login
sitesavvy crawl https://private.example.com \
    --login-url https://private.example.com/login \
    --login-user alice --login-pass secret --out-dir ./out

# Crawl through a Tor SOCKS5 proxy + stealth mode
pip install aiohttp_socks   # one-time, for SOCKS support
sitesavvy crawl https://example.onion \
    --proxy socks5://127.0.0.1:9050 --stealth --out-dir ./out

# Docs-site mode: strip sidebars, build a clean TOC for mkdocs/Docusaurus
sitesavvy crawl https://docusaurus.io --docs-mode --mode text --format md --out-dir ./out

Command reference

Global options

Flag	Description
`--version`	Print the SiteSavvy version and exit.
`--verbose` / `-v`	Enable debug logging.

`sitesavvy crawl` — the main crawler

Flag	Default	Description
`url` (positional)	—	Starting URL to crawl (`http` / `https`).
`--depth INT`	`0`	Max link depth (`0` = unlimited).
`--mode {full,text}`	`full`	Full-site download or text-only extraction.
`--format …`	`html`	Output format, repeatable: `html md txt pdf epub zip sqlite warc obsidian`.
`--out-dir PATH`	CWD	Destination folder.
`--concurrency N`	`4`	Simultaneous HTTP requests.
`--user-agent STR`	browser-like	Custom `User-Agent` header.
`--respect-robots` / `--no-respect-robots`	on	Obey `robots.txt`.
`--delay SECS`	`0.5`	Polite delay between same-host requests.
`--resume`	off	Skip URLs already completed in the manifest.
`--manifest FILE`	`<out-dir>/manifest.json`	Manifest path.
`--dry-run`	off	List URLs that would be fetched.
`--headless`	off	Render JS pages with Playwright.
`--rate-limit {auto,fixed}`	`auto`	Back off on 429/5xx, or use fixed delay.
`--download-types …`	all	Comma-separated: `html,css,js,img,pdf,other`.
`--incremental`	off	Re-download only changed resources (conditional GET).
`--external`	off	Follow cross-domain links.
`--force`	off	Proceed even if `robots.txt` disallows the start URL.
`--timeout SECS`	`30`	Per-request timeout.
`--include PATTERN`	—	URL pattern to include (repeatable; glob with `` / `*` or `re:<regex>`).
`--exclude PATTERN`	—	URL pattern to exclude (repeatable; glob or `re:<regex>`).
`--scope SELECTOR`	—	CSS selector restricting link discovery + content extraction.
`--max-pages N`	—	Stop after this many pages (budget).
`--max-bytes N`	—	Stop after downloading this many bytes (budget).
`--max-time SECS`	—	Stop after this many seconds (budget).
`--proxy URL`	—	Proxy URL (`http://`, `https://`, `socks5://`).
`--screenshots`	off	Capture full-page PNGs (headless).
`--archive`	off	Submit every page to the Wayback Machine.
`--ai-extract`	off	Use an LLM to extract main content (text mode).
`--summarize`	off	Generate per-page summaries + a site digest.
`--categorize`	off	Tag each page with an AI category.
`--structured`	off	Emit JSON-LD / Open Graph / table sidecars.
`--sitemap`	off	Seed URLs from `sitemap.xml` (incl. sitemap indexes).
`--index`	off	Build a RAG index for `sitesavvy ask`.
`--report`	off	Write a self-contained `crawl-report.html`.
`--config FILE`	—	Path to a `sitesavvy.toml` config file.
`--profile NAME`	—	Named profile from the config file.
`--preset NAME`	—	Built-in preset: `docs` / `blog` / `wiki` / `shop` / `archive` / `research`.
`--follow-pagination` / `--no-pagination`	on	Follow `rel="next"` / `?page=N` without consuming depth budget.
`--login-url URL`	—	Login form URL for authenticated crawling.
`--login-user STR`	—	Username for the login form.
`--login-pass STR`	—	Password for the login form.
`--stealth`	off	Rotate User-Agent, jitter timing, realistic header ordering.
`--docs-mode`	off	Docs-site-aware extraction (strips sidebars, builds TOC).
`--recipe-mode`	off	Collect schema.org Recipes into a `sitesavvy-cookbook.epub`.
`--offline-search`	off	Build a self-contained `search.html` + `index.json` for offline search.

`sitesavvy ask` — RAG question-answering

sitesavvy ask "what does this site say about pricing?" [--index PATH] [--top-k N]

Flag	Default	Description
`question` (positional)	—	Question to answer from the crawled mirror.
`--index PATH`	`./sitesavvy.index.db`	Path to the RAG index built by `crawl --index`.
`--top-k N`	`5`	Number of pages to retrieve and feed to the LLM.

`sitesavvy mcp` — MCP server

sitesavvy mcp

Starts the Model Context Protocol server over stdio. No flags. Requires the optional mcp package (pip install 'sitesavvy[mcp]'). See MCP server above.

`sitesavvy new` — interactive wizard

sitesavvy new [--config PATH]

Flag	Default	Description
`--config PATH`	—	Write a `sitesavvy.toml` config file instead of printing a command.

Walks you through a few prompts (URL, mode, formats, scope, budgets, AI flags) and either prints a ready-to-run crawl command or writes a sitesavvy.toml.

`sitesavvy diff` — compare two crawls

sitesavvy diff <old-manifest.json> <new-manifest.json> <old-dir> <new-dir> [--output PATH]

Flag	Default	Description
`old` (positional)	—	Path to the old crawl's `manifest.json`.
`new` (positional)	—	Path to the new crawl's `manifest.json`.
`old_dir` (positional)	—	Old crawl's output directory.
`new_dir` (positional)	—	New crawl's output directory.
`--output PATH`	—	Write the Markdown diff report to this path (default: stdout).

Reports added, removed and changed pages between two crawls, including a diff of the page bodies for changed pages.

`sitesavvy list-presets` — list built-in presets

sitesavvy list-presets

No flags. Prints a table of the six built-in presets and their use cases.

`sitesavvy init-config` — write an example config

sitesavvy init-config [--output PATH]

Flag	Default	Description
`--output PATH`	`sitesavvy.toml`	Where to write the example config.

`sitesavvy legal` and `sitesavvy info`

sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show installed backends + AI configuration status

Presets

Six built-in presets cover the common crawl shapes. Pass --preset <name> and SiteSavvy fills in the flags for you. A preset can be combined with --profile from your sitesavvy.toml (the profile overrides the preset).

Preset	Mode	Formats	Highlights	Use case
`docs`	`text`	`md`, `pdf`	`scope=article`, `depth=3`, `delay=0.2`	Documentation sites (Docusaurus, MkDocs, Sphinx).
`blog`	`text`	`md`	`include=/blog/*`, excludes pagination / tags / categories	A blog archive for offline reading.
`wiki`	`text`	`md`, `epub`	`depth=0` (unlimited), `delay=0.3`	Mirror a wiki into an EPUB for an e-reader.
`shop`	`full`	`html`, `zip`	`include=/product/`, `/products/`, excludes cart / checkout / account	Archive an e-commerce catalogue.
`archive`	`full`	`html`, `warc`, `zip`	`depth=0`, `respect_robots=true`	Long-term archival — WARC for replay, ZIP for sharing.
`research`	`text`	`md`, `sqlite`	`summarize=true`, `depth=0`, `delay=0`	Research crawl with AI summaries + a queryable SQLite store.

Example: combine a preset with the --index flag to also build a RAG index for ask:

sitesavvy crawl https://docs.example.com --preset docs --index
sitesavvy ask "how do I configure authentication?"

Output formats

The format matrix in Features lists all nine formats and their backends. A few notes:

html preserves the original byte stream and the site's directory hierarchy, so a full crawl can be served verbatim from disk.
zip packages the entire crawl (any combination of other formats) into a single archive for easy sharing.
sqlite stores one row per page with URL, title, extracted text, metadata and (when --index is set) embedding vectors — perfect for downstream analysis or for sitesavvy ask.
warc writes an ISO 28500:2017 archive that opens in replayweb.page and any Web Archive player.
obsidian exports a Markdown vault with YAML frontmatter and [[wikilinks]] between pages, ready to drop into an Obsidian vault.

Sample Markdown output:

---
url: https://example.com/page
title: Page Title
category: tutorial
summary: A short paragraph summarizing the page.
---

# Page Title

## A heading

Some paragraph text with a [link](https://example.com/other).

Architecture

sitesavvy/
├── __init__.py            # package metadata
├── __main__.py            # python -m sitesavvy
├── __about__.py           # version
├── config.py              # CrawlConfig + enums (CrawlMode, OutputFormat, ...)
├── models.py              # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py           # normalisation, link extraction, path mapping
├── robots.py              # async robots.txt (reppy or stdlib fallback)
├── conversions.py         # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py            # resume / incremental state
├── headless.py            # Playwright fetcher
├── crawler.py             # the Crawler engine (orchestrates everything)
├── legal.py               # disclaimer text
├── cli.py                 # Typer + Rich CLI (crawl, ask, mcp, new, diff, ...)
├── main.py                # console-script entry point
├── ai.py                  # LLM client + LLMConfig (OpenAI-compatible)
├── rag.py                 # SQLite vector store + cosine similarity search
├── mcp_server.py          # MCP server exposing 6 tools over stdio
├── feeds.py               # RSS / Atom feed discovery + seeding
├── patterns.py            # glob + regex URL pattern matching
├── structured.py          # JSON-LD / Open Graph / table sidecar extraction
├── warc.py                # ISO 28500:2017 WARC writer
├── sqlite_export.py       # SQLite exporter (rows + embeddings)
├── report.py              # self-contained HTML crawl report
├── config_file.py         # sitesavvy.toml parsing + 6 built-in presets
├── budgets.py             # page / byte / time budget enforcement
├── wayback.py             # Wayback Machine submission
├── scope.py               # CSS selector scoping for discovery + extraction
├── screenshots.py         # full-page PNG capture (headless)
├── diff.py                # cross-crawl added/removed/changed diff
├── obsidian.py            # Obsidian vault exporter (wikilinks + frontmatter)
├── wizard.py              # interactive `sitesavvy new` wizard
├── pagination.py          # rel="next" / ?page=N awareness (no depth cost)
├── auth.py                # form-based login + CSRF detection
├── proxies.py             # http/https/socks5 connector builder
├── stealth.py             # UA rotation + header jitter + timing jitter
├── recipe.py              # schema.org Recipe → cookbook EPUB
├── docs_mode.py           # docs-site-aware extraction (mkdocs/Docusaurus/Sphinx)
└── offline_search.py      # self-contained search.html + index.json

Networking layer: aiohttp (primary) with an optional Playwright headless browser for JS-rendered pages, and httpx as the LLM / embeddings client. HTML parsing uses beautifulsoup4 + lxml. robots.txt is parsed with reppy when available, otherwise with the stdlib urllib.robotparser.

See the Architecture docs for a deeper dive, including a Mermaid flow diagram of a crawl.

Troubleshooting

HTTP 429 Too Many Requests — lower --concurrency, raise --delay, and keep --rate-limit auto (default) so SiteSavvy backs off automatically.
Large sites — set --depth to bound the crawl, run with --dry-run first to estimate scope, and use --resume so an interruption doesn't waste work. --max-pages, --max-bytes and --max-time add hard budgets that stop the crawl cleanly and leave a resumable manifest behind.
PDF export fails — WeasyPrint needs Pango/Cairo system libraries. On Debian/Ubuntu: apt install libpango-1.0-0 libpangoft2-1.0-0. On macOS: brew install pango. The other formats keep working even if PDF is missing.
Headless mode crashes — run playwright install chromium once after installing the package. Without it, SiteSavvy transparently falls back to aiohttp.
robots.txt disallows … — by default SiteSavvy honours robots.txt. Add --force only if you have permission and accept responsibility.
AI features silently skipped / No LLM API key set — export SITESAVVY_LLM_API_KEY (or set it in your shell profile). Run sitesavvy info to confirm the key is detected. AI flags are no-ops without a key; the crawl itself still completes normally.
MCP server failed to start — the optional mcp package is missing. Install it with pip install 'sitesavvy[mcp]' and try again.
Proxy issues — --proxy accepts http://, https:// and socks5:// URLs. For SOCKS proxies, make sure the aiohttp-socks extra is installed (pip install aiohttp-socks). If requests time out, check that the proxy allows CONNECT to port 443.
Windows: long path errors — enable long paths (New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1) or use a shorter --out-dir.
command not found: sitesavvy — make sure pip install's bin / Scripts directory is on your PATH, or use python -m sitesavvy.

Legal & ethics

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. AI features send extracted page content to your configured LLM provider — make sure you have the right to do so for the sites you crawl. The authors assume no liability for misuse. Run sitesavvy legal to read the full disclaimer. Licensed under the MIT License.

Contributing

Pull requests are welcome! Please run the full check suite before submitting:

ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing

Coverage must stay at or above 90 %. See the Developer Guide for the project layout, release process and binary-building instructions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Jun 24, 2026

0.5.0

Jun 23, 2026

0.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitesavvy-0.6.0.tar.gz (104.1 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sitesavvy-0.6.0-py3-none-any.whl (104.8 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file sitesavvy-0.6.0.tar.gz.

File metadata

Download URL: sitesavvy-0.6.0.tar.gz
Upload date: Jun 24, 2026
Size: 104.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`f12940f00cf2271492d2cc529ff068f826589a4881cd04cf6a7f003ad08d44f1`
MD5	`32de9e93905c55a5be2b4152746be88f`
BLAKE2b-256	`8361028029241b58e99ab8404f76b16223816b957cbcf4d2087de17c38441afc`

See more details on using hashes here.

File details

Details for the file sitesavvy-0.6.0-py3-none-any.whl.

File metadata

Download URL: sitesavvy-0.6.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 104.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c9b575c847e427f84982f83f556f4c95acd60b94bfa72389865884f2c2e5754`
MD5	`b34026be644c4a4c883f1f1d1c6128da`
BLAKE2b-256	`3861e19554e8e42fb1168ad8404cff08040e9b61010b3d224a6fa5a21ac9d9bf`

See more details on using hashes here.

sitesavvy 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SiteSavvy

What's new in v0.6.0

What's new in v0.5.0

Features

Crawl modes

Output formats

AI & intelligence

MCP server

Scraping power

Politeness

Config & UX

Installation

Option 1 — From PyPI (recommended)

Option 2 — Stand-alone binary (no Python required)

Option 3 — From source (development)

AI configuration

MCP server

Claude Desktop

Cursor, VS Code Copilot, and other MCP clients

Quick start

Full-site mirror → ZIP

Text-only crawl → Markdown + EPUB

AI research crawl → Markdown + SQLite + summaries + RAG index

Ask a question about a crawled site

Preset crawl (one-shot sensible defaults)

Dry-run, resume, incremental, headless

v0.6.0: offline-searchable mirror, recipes, auth, stealth

Command reference

Global options

sitesavvy crawl — the main crawler

sitesavvy ask — RAG question-answering

sitesavvy mcp — MCP server

sitesavvy new — interactive wizard

sitesavvy diff — compare two crawls

sitesavvy list-presets — list built-in presets

sitesavvy init-config — write an example config

sitesavvy legal and sitesavvy info

Presets

Output formats

Architecture

Troubleshooting

Legal & ethics

Contributing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`sitesavvy crawl` — the main crawler

`sitesavvy ask` — RAG question-answering

`sitesavvy mcp` — MCP server

`sitesavvy new` — interactive wizard

`sitesavvy diff` — compare two crawls

`sitesavvy list-presets` — list built-in presets

`sitesavvy init-config` — write an example config

`sitesavvy legal` and `sitesavvy info`