Skip to main content

Capture the web, your way. A modern, async, cross-platform web scraper.

Project description

SiteSavvy

PyPI version PyPI downloads Python versions CI Coverage License: MIT Release

Capture the web, your way.

SiteSavvy is a modern, async, cross-platform web scraper that mirrors entire sites or extracts their readable text โ€” and now also an AI-powered research tool. Beyond the original HTML / Markdown / text / PDF / EPUB / ZIP exports, v0.5.0 adds LLM content extraction, per-page summaries and auto-categorization, RAG question-answering (sitesavvy ask "..."), an MCP server that exposes crawl / search / ask to Claude, Cursor and VS Code Copilot, and nine output formats including SQLite, WARC and Obsidian vaults. Point it at a docs site, a blog, a shop or a whole wiki, and SiteSavvy will quietly fetch, parse, summarize, index and answer questions about it โ€” politely, resumably, and on your laptop.

pip install sitesavvy

A basic crawl:

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

An AI-powered research crawl that extracts clean text, summarizes every page and builds a RAG index you can ask questions of:

sitesavvy crawl https://example.com --mode text --format md sqlite \
    --summarize --categorize --index --out-dir ./research

Then ask a question in natural language:

sitesavvy ask "what does this site say about pricing?"

What's new in v0.6.0

Note: v0.6.0 completes the feature set with 7 new modules covering pagination, authentication, proxy/Tor, stealth, recipes, docs-site mode, and offline full-text search.

  • ๐Ÿ“„ Pagination awareness โ€” follows rel="next" / ?page=N without consuming the depth budget
  • ๐Ÿ” Authenticated crawling โ€” --login-url / --login-user / --login-pass with CSRF detection
  • ๐ŸŒ Proxy / Tor / SOCKS5 โ€” --proxy http://... or socks5://... (SOCKS via optional aiohttp_socks)
  • ๐Ÿฅธ Stealth mode โ€” --stealth rotates UA from 15 real browser strings, jitters timing, realistic headers
  • ๐Ÿณ Recipe mode โ€” --recipe-mode detects schema.org Recipes and builds a sitesavvy-cookbook.epub
  • ๐Ÿ“š Docs-site mode โ€” --docs-mode strips sidebars/nav for mkdocs / Docusaurus / Sphinx / ReadTheDocs, builds a TOC
  • ๐Ÿ” Offline search โ€” --offline-search builds a self-contained search.html + index.json (works from file://)

See the CHANGELOG for the full list.


What's new in v0.5.0

Note: v0.5.0 is a major release that turns SiteSavvy from a scraper into an AI-powered research assistant. The headline additions:

  • ๐Ÿค– AI: LLM content extraction, per-page summaries, auto-categorization
  • ๐Ÿ’ฌ RAG: sitesavvy ask "..." โ€” ask questions about crawled sites
  • ๐Ÿ”Œ MCP server: expose crawl / search / ask to Claude, Cursor, VS Code Copilot
  • ๐Ÿ“‹ 9 output formats: html, md, txt, pdf, epub, zip, sqlite, warc, obsidian
  • ๐ŸŽฏ URL patterns (--include / --exclude), CSS scope, sitemap seeding
  • ๐Ÿ“Š Budgets (--max-pages / --max-bytes / --max-time), HTML reports
  • ๐Ÿ”„ Content diff between crawls, Wayback Machine archiving
  • โš™๏ธ Config files + 6 presets (docs / blog / wiki / shop / archive / research)

See the CHANGELOG for the full list, and the Architecture section below for the new module layout.


Features

Crawl modes

  • full โ€” recursively download every reachable resource (HTML, CSS, JS, images, PDFs, fonts, โ€ฆ) preserving the original directory hierarchy.
  • text โ€” extract the readable text from each HTML page (strips scripts, navigation, ads) and store it in your chosen format.

Output formats

Nine formats, repeatable via --format:

Format Mode full Mode text Backend
html original bytes, hierarchy preserved โ€” built-in
md โ€” markdownify (ATX headings, links absolute) markdownify
txt โ€” html2text (no hard wrap) html2text
pdf โ€” WeasyPrint weasyprint
epub โ€” ebooklib, one chapter per page ebooklib
zip archive of the whole crawl archive of the whole crawl zipfile
sqlite one row per page (URL, title, text, meta) one row per page + embeddings sqlite3
warc ISO 28500:2017 archive (replayweb.page-compatible) โ€” built-in
obsidian โ€” Markdown vault with [[wikilinks]] + frontmatter built-in

AI & intelligence

  • LLM content extraction (--ai-extract) โ€” let an LLM pull the main article out of cluttered pages in text mode.
  • Per-page summaries + site digest (--summarize) โ€” every page gets a one-paragraph summary, plus a top-level digest.md overview.
  • Auto-categorization (--categorize) โ€” each page is tagged with an AI-derived category (e.g. tutorial, pricing, API reference).
  • RAG question-answering (sitesavvy ask "...") โ€” semantic search over a SQLite vector store (cosine similarity) plus an LLM that synthesizes an answer from the top-k retrieved chunks, with source URLs.

MCP server

sitesavvy mcp starts a Model Context Protocol server (stdio transport) that exposes SiteSavvy to AI assistants. Six tools are available:

Tool What it does
crawl Run a crawl with the same options as the CLI.
list_pages List pages in a finished crawl's manifest.
search Full-text search over a crawled mirror.
get_page Fetch the body of a single page by URL.
ask RAG question-answering over a crawled mirror.
info Report installed backends and AI configuration.

Configure it once in your client (see MCP server below) and your assistant can crawl, read and reason about the web without leaving the chat.

Scraping power

  • Sitemap seeding (--sitemap) โ€” discover and parse sitemap.xml, including sitemap indexes.
  • RSS / Atom feed discovery (--feeds) โ€” seed URLs from feed entries.
  • URL pattern filtering โ€” --include / --exclude accept globs (* / **) or re:<regex> patterns; the start URL is always allowed.
  • CSS scope (--scope "main") โ€” restricts both link discovery and content extraction to a subtree.
  • Budgets โ€” --max-pages, --max-bytes, --max-time stop crawls cleanly and leave a resumable manifest behind.
  • Proxy support โ€” --proxy http://host:port or socks5://host:port.
  • Screenshots (--screenshots) โ€” capture full-page PNGs in headless mode.
  • Headless rendering (--headless) via Playwright (falls back to aiohttp automatically when no browser binary is installed).

Politeness

  • Robots.txt compliance by default, with --force override.
  • Per-host delay (--delay) and auto-throttle on 429 / 5xx (--rate-limit auto, the default) with exponential back-off plus jitter.
  • Resume (--resume) โ€” skip URLs already completed in the manifest.
  • Incremental (--incremental) โ€” re-download only changed resources via conditional GETs (ETag / Last-Modified, 304 Not Modified).
  • External-link gating โ€” stays on the start host unless you pass --external.

Config & UX

  • Config files โ€” sitesavvy.toml with [default] and [profiles.<name>] sections; --config and --profile flags load them.
  • 6 built-in presets โ€” --preset docs|blog|wiki|shop|archive|research (see Presets).
  • Interactive wizard โ€” sitesavvy new walks you through a few prompts and prints a ready-to-run crawl command (or writes a sitesavvy.toml).
  • sitesavvy init-config โ€” writes an example sitesavvy.toml.
  • sitesavvy list-presets โ€” lists available presets.
  • HTML report (--report) โ€” writes a self-contained crawl-report.html summarizing URLs fetched, failures, formats produced and AI summaries.
  • Content diff (sitesavvy diff <old> <new> <old-dir> <new-dir>) โ€” compares two crawls and reports added / removed / changed pages as Markdown.
  • Wayback Machine (--archive) โ€” submits every fetched page to web.archive.org (fire-and-forget).
  • Rich CLI with progress tables, coloured output and --verbose / -v debug logging.

Installation

Option 1 โ€” From PyPI (recommended)

pip install sitesavvy

For the MCP server, install the optional extra:

pip install 'sitesavvy[mcp]'

Verify the install:

sitesavvy --version
sitesavvy info   # show which optional backends are installed + AI config

Option 2 โ€” Stand-alone binary (no Python required)

Download the right archive for your OS from the latest release, extract it, and run:

OS Asset How to run
Linux (x86_64) sitesavvy-0.5.0-linux-x86_64.tar.gz tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help
macOS (x86_64) sitesavvy-0.5.0-macos-x86_64.tar.gz tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help
Windows (x86_64) sitesavvy-0.5.0-windows-x86_64.exe Double-click, or run sitesavvy.exe --help in PowerShell

These are single-file PyInstaller executables โ€” no Python installation needed.

Option 3 โ€” From source (development)

git clone https://github.com/Bloody-Crow/SiteSavvy.git
cd SiteSavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless / --screenshots

A plain pip install -r requirements.txt is also supported if you prefer to skip the PEP 517 build.


AI configuration

AI features (extraction, summaries, categorization, RAG ask) talk to any OpenAI-compatible endpoint: OpenAI itself, Ollama, vLLM, LM Studio, Groq, Together, OpenRouter, and any other server that implements the /chat/completions and /embeddings routes.

Configure via environment variables:

Variable Default Purpose
SITESAVVY_LLM_BASE_URL https://api.openai.com/v1 OpenAI-compatible API base URL.
SITESAVVY_LLM_API_KEY (empty) API key. Required for AI features.
SITESAVVY_LLM_MODEL gpt-4o-mini Chat model for extraction / summaries / ask.
SITESAVVY_EMBED_MODEL text-embedding-3-small Embedding model for the RAG index.

Example โ€” OpenAI:

export SITESAVVY_LLM_API_KEY="sk-..."
sitesavvy crawl https://example.com --mode text --format md --summarize --index

Example โ€” local Ollama (no API key needed):

export SITESAVVY_LLM_BASE_URL="http://localhost:11434/v1"
export SITESAVVY_LLM_API_KEY="ollama"   # any non-empty string
export SITESAVVY_LLM_MODEL="llama3.1"
export SITESAVVY_EMBED_MODEL="nomic-embed-text"
sitesavvy crawl https://example.com --mode text --summarize --index

Note: AI features are strictly opt-in. SiteSavvy never makes network calls to an LLM provider unless you explicitly pass --ai-extract, --summarize, --categorize or --index, or invoke sitesavvy ask. Without an API key the AI flags are silently skipped and the crawl completes normally โ€” see Troubleshooting.

Run sitesavvy info at any time to see the configured base URL, model, and whether an API key is set.


MCP server

sitesavvy mcp runs SiteSavvy as a Model Context Protocol server over stdio, exposing the six tools listed in Features to any MCP-compatible client.

Claude Desktop

Add an entry to your Claude Desktop config (macOS: ~/Library/Application Support/Claude/claude_desktop_config.json, Windows: %APPDATA%\Claude\claude_desktop_config.json):

{
  "mcpServers": {
    "sitesavvy": {
      "command": "sitesavvy",
      "args": ["mcp"]
    }
  }
}

Restart Claude Desktop. You can now ask things like "crawl https://docs.example.com and tell me how their auth flow works" and Claude will call the crawl and ask tools for you.

Cursor, VS Code Copilot, and other MCP clients

The same command: sitesavvy, args: ["mcp"] snippet works in any client that speaks MCP. See your client's docs for where to register MCP servers.

Note: The MCP server requires the optional mcp package. Install it with pip install 'sitesavvy[mcp]'. If it's missing, sitesavvy mcp prints a helpful error pointing at the install command.


Quick start

Full-site mirror โ†’ ZIP

sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out

Text-only crawl โ†’ Markdown + EPUB

sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader

AI research crawl โ†’ Markdown + SQLite + summaries + RAG index

sitesavvy crawl https://example.com \
    --mode text --format md sqlite \
    --summarize --categorize --index \
    --out-dir ./research

Ask a question about a crawled site

sitesavvy ask "what does this site say about pricing?"

ask uses the RAG index built by --index (default location ./sitesavvy.index.db). Pass --index /path/to/index.db and --top-k 10 to customise retrieval.

Preset crawl (one-shot sensible defaults)

sitesavvy crawl https://docusaurus.io --preset docs
sitesavvy crawl https://shop.example.com --preset shop
sitesavvy crawl https://wiki.example.com --preset wiki

See Presets for what each one does.

Dry-run, resume, incremental, headless

# List URLs that would be fetched, without writing anything
sitesavvy crawl https://example.com --dry-run --depth 1

# Resume an interrupted crawl
sitesavvy crawl https://example.com --depth 3 --resume \
    --manifest ./out/manifest.json --out-dir ./out

# Only re-download changed resources
sitesavvy crawl https://example.com --incremental \
    --manifest ./out/manifest.json --out-dir ./out

# Render JavaScript-heavy pages with Playwright
sitesavvy crawl https://spa.example.com --headless --format html

v0.6.0: offline-searchable mirror, recipes, auth, stealth

# Download a site AND build a self-contained offline search UI
# โ†’ open ./out/search.html in any browser (works from file://)
sitesavvy crawl https://docs.example.com --offline-search --format html --out-dir ./out

# Scrape a recipe site and get a cookbook EPUB
# โ†’ ./out/sitesavvy-cookbook.epub (one chapter per recipe)
sitesavvy crawl https://recipes.example.com --recipe-mode --out-dir ./out

# Crawl a site that requires a login
sitesavvy crawl https://private.example.com \
    --login-url https://private.example.com/login \
    --login-user alice --login-pass secret --out-dir ./out

# Crawl through a Tor SOCKS5 proxy + stealth mode
pip install aiohttp_socks   # one-time, for SOCKS support
sitesavvy crawl https://example.onion \
    --proxy socks5://127.0.0.1:9050 --stealth --out-dir ./out

# Docs-site mode: strip sidebars, build a clean TOC for mkdocs/Docusaurus
sitesavvy crawl https://docusaurus.io --docs-mode --mode text --format md --out-dir ./out

Command reference

Global options

Flag Description
--version Print the SiteSavvy version and exit.
--verbose / -v Enable debug logging.

sitesavvy crawl โ€” the main crawler

Flag Default Description
url (positional) โ€” Starting URL to crawl (http / https).
--depth INT 0 Max link depth (0 = unlimited).
--mode {full,text} full Full-site download or text-only extraction.
--format โ€ฆ html Output format, repeatable: html md txt pdf epub zip sqlite warc obsidian.
--out-dir PATH CWD Destination folder.
--concurrency N 4 Simultaneous HTTP requests.
--user-agent STR browser-like Custom User-Agent header.
--respect-robots / --no-respect-robots on Obey robots.txt.
--delay SECS 0.5 Polite delay between same-host requests.
--resume off Skip URLs already completed in the manifest.
--manifest FILE <out-dir>/manifest.json Manifest path.
--dry-run off List URLs that would be fetched.
--headless off Render JS pages with Playwright.
--rate-limit {auto,fixed} auto Back off on 429/5xx, or use fixed delay.
--download-types โ€ฆ all Comma-separated: html,css,js,img,pdf,other.
--incremental off Re-download only changed resources (conditional GET).
--external off Follow cross-domain links.
--force off Proceed even if robots.txt disallows the start URL.
--timeout SECS 30 Per-request timeout.
--include PATTERN โ€” URL pattern to include (repeatable; glob with * / ** or re:<regex>).
--exclude PATTERN โ€” URL pattern to exclude (repeatable; glob or re:<regex>).
--scope SELECTOR โ€” CSS selector restricting link discovery + content extraction.
--max-pages N โ€” Stop after this many pages (budget).
--max-bytes N โ€” Stop after downloading this many bytes (budget).
--max-time SECS โ€” Stop after this many seconds (budget).
--proxy URL โ€” Proxy URL (http://, https://, socks5://).
--screenshots off Capture full-page PNGs (headless).
--archive off Submit every page to the Wayback Machine.
--ai-extract off Use an LLM to extract main content (text mode).
--summarize off Generate per-page summaries + a site digest.
--categorize off Tag each page with an AI category.
--structured off Emit JSON-LD / Open Graph / table sidecars.
--sitemap off Seed URLs from sitemap.xml (incl. sitemap indexes).
--index off Build a RAG index for sitesavvy ask.
--report off Write a self-contained crawl-report.html.
--config FILE โ€” Path to a sitesavvy.toml config file.
--profile NAME โ€” Named profile from the config file.
--preset NAME โ€” Built-in preset: docs / blog / wiki / shop / archive / research.
--follow-pagination / --no-pagination on Follow rel="next" / ?page=N without consuming depth budget.
--login-url URL โ€” Login form URL for authenticated crawling.
--login-user STR โ€” Username for the login form.
--login-pass STR โ€” Password for the login form.
--stealth off Rotate User-Agent, jitter timing, realistic header ordering.
--docs-mode off Docs-site-aware extraction (strips sidebars, builds TOC).
--recipe-mode off Collect schema.org Recipes into a sitesavvy-cookbook.epub.
--offline-search off Build a self-contained search.html + index.json for offline search.

sitesavvy ask โ€” RAG question-answering

sitesavvy ask "what does this site say about pricing?" [--index PATH] [--top-k N]
Flag Default Description
question (positional) โ€” Question to answer from the crawled mirror.
--index PATH ./sitesavvy.index.db Path to the RAG index built by crawl --index.
--top-k N 5 Number of pages to retrieve and feed to the LLM.

sitesavvy mcp โ€” MCP server

sitesavvy mcp

Starts the Model Context Protocol server over stdio. No flags. Requires the optional mcp package (pip install 'sitesavvy[mcp]'). See MCP server above.

sitesavvy new โ€” interactive wizard

sitesavvy new [--config PATH]
Flag Default Description
--config PATH โ€” Write a sitesavvy.toml config file instead of printing a command.

Walks you through a few prompts (URL, mode, formats, scope, budgets, AI flags) and either prints a ready-to-run crawl command or writes a sitesavvy.toml.

sitesavvy diff โ€” compare two crawls

sitesavvy diff <old-manifest.json> <new-manifest.json> <old-dir> <new-dir> [--output PATH]
Flag Default Description
old (positional) โ€” Path to the old crawl's manifest.json.
new (positional) โ€” Path to the new crawl's manifest.json.
old_dir (positional) โ€” Old crawl's output directory.
new_dir (positional) โ€” New crawl's output directory.
--output PATH โ€” Write the Markdown diff report to this path (default: stdout).

Reports added, removed and changed pages between two crawls, including a diff of the page bodies for changed pages.

sitesavvy list-presets โ€” list built-in presets

sitesavvy list-presets

No flags. Prints a table of the six built-in presets and their use cases.

sitesavvy init-config โ€” write an example config

sitesavvy init-config [--output PATH]
Flag Default Description
--output PATH sitesavvy.toml Where to write the example config.

sitesavvy legal and sitesavvy info

sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show installed backends + AI configuration status

Presets

Six built-in presets cover the common crawl shapes. Pass --preset <name> and SiteSavvy fills in the flags for you. A preset can be combined with --profile from your sitesavvy.toml (the profile overrides the preset).

Preset Mode Formats Highlights Use case
docs text md, pdf scope=article, depth=3, delay=0.2 Documentation sites (Docusaurus, MkDocs, Sphinx).
blog text md include=/blog/*, excludes pagination / tags / categories A blog archive for offline reading.
wiki text md, epub depth=0 (unlimited), delay=0.3 Mirror a wiki into an EPUB for an e-reader.
shop full html, zip include=/product/*, /products/*, excludes cart / checkout / account Archive an e-commerce catalogue.
archive full html, warc, zip depth=0, respect_robots=true Long-term archival โ€” WARC for replay, ZIP for sharing.
research text md, sqlite summarize=true, depth=0, delay=0 Research crawl with AI summaries + a queryable SQLite store.

Example: combine a preset with the --index flag to also build a RAG index for ask:

sitesavvy crawl https://docs.example.com --preset docs --index
sitesavvy ask "how do I configure authentication?"

Output formats

The format matrix in Features lists all nine formats and their backends. A few notes:

  • html preserves the original byte stream and the site's directory hierarchy, so a full crawl can be served verbatim from disk.
  • zip packages the entire crawl (any combination of other formats) into a single archive for easy sharing.
  • sqlite stores one row per page with URL, title, extracted text, metadata and (when --index is set) embedding vectors โ€” perfect for downstream analysis or for sitesavvy ask.
  • warc writes an ISO 28500:2017 archive that opens in replayweb.page and any Web Archive player.
  • obsidian exports a Markdown vault with YAML frontmatter and [[wikilinks]] between pages, ready to drop into an Obsidian vault.

Sample Markdown output:

---
url: https://example.com/page
title: Page Title
category: tutorial
summary: A short paragraph summarizing the page.
---

# Page Title

## A heading

Some paragraph text with a [link](https://example.com/other).

Architecture

sitesavvy/
โ”œโ”€โ”€ __init__.py            # package metadata
โ”œโ”€โ”€ __main__.py            # python -m sitesavvy
โ”œโ”€โ”€ __about__.py           # version
โ”œโ”€โ”€ config.py              # CrawlConfig + enums (CrawlMode, OutputFormat, ...)
โ”œโ”€โ”€ models.py              # CrawlItem, FetchResult, ManifestEntry
โ”œโ”€โ”€ url_utils.py           # normalisation, link extraction, path mapping
โ”œโ”€โ”€ robots.py              # async robots.txt (reppy or stdlib fallback)
โ”œโ”€โ”€ conversions.py         # HTML โ†’ MD/TXT/PDF/EPUB + ZIP
โ”œโ”€โ”€ manifest.py            # resume / incremental state
โ”œโ”€โ”€ headless.py            # Playwright fetcher
โ”œโ”€โ”€ crawler.py             # the Crawler engine (orchestrates everything)
โ”œโ”€โ”€ legal.py               # disclaimer text
โ”œโ”€โ”€ cli.py                 # Typer + Rich CLI (crawl, ask, mcp, new, diff, ...)
โ”œโ”€โ”€ main.py                # console-script entry point
โ”œโ”€โ”€ ai.py                  # LLM client + LLMConfig (OpenAI-compatible)
โ”œโ”€โ”€ rag.py                 # SQLite vector store + cosine similarity search
โ”œโ”€โ”€ mcp_server.py          # MCP server exposing 6 tools over stdio
โ”œโ”€โ”€ feeds.py               # RSS / Atom feed discovery + seeding
โ”œโ”€โ”€ patterns.py            # glob + regex URL pattern matching
โ”œโ”€โ”€ structured.py          # JSON-LD / Open Graph / table sidecar extraction
โ”œโ”€โ”€ warc.py                # ISO 28500:2017 WARC writer
โ”œโ”€โ”€ sqlite_export.py       # SQLite exporter (rows + embeddings)
โ”œโ”€โ”€ report.py              # self-contained HTML crawl report
โ”œโ”€โ”€ config_file.py         # sitesavvy.toml parsing + 6 built-in presets
โ”œโ”€โ”€ budgets.py             # page / byte / time budget enforcement
โ”œโ”€โ”€ wayback.py             # Wayback Machine submission
โ”œโ”€โ”€ scope.py               # CSS selector scoping for discovery + extraction
โ”œโ”€โ”€ screenshots.py         # full-page PNG capture (headless)
โ”œโ”€โ”€ diff.py                # cross-crawl added/removed/changed diff
โ”œโ”€โ”€ obsidian.py            # Obsidian vault exporter (wikilinks + frontmatter)
โ”œโ”€โ”€ wizard.py              # interactive `sitesavvy new` wizard
โ”œโ”€โ”€ pagination.py          # rel="next" / ?page=N awareness (no depth cost)
โ”œโ”€โ”€ auth.py                # form-based login + CSRF detection
โ”œโ”€โ”€ proxies.py             # http/https/socks5 connector builder
โ”œโ”€โ”€ stealth.py             # UA rotation + header jitter + timing jitter
โ”œโ”€โ”€ recipe.py              # schema.org Recipe โ†’ cookbook EPUB
โ”œโ”€โ”€ docs_mode.py           # docs-site-aware extraction (mkdocs/Docusaurus/Sphinx)
โ””โ”€โ”€ offline_search.py      # self-contained search.html + index.json

Networking layer: aiohttp (primary) with an optional Playwright headless browser for JS-rendered pages, and httpx as the LLM / embeddings client. HTML parsing uses beautifulsoup4 + lxml. robots.txt is parsed with reppy when available, otherwise with the stdlib urllib.robotparser.

See the Architecture docs for a deeper dive, including a Mermaid flow diagram of a crawl.


Troubleshooting

  • HTTP 429 Too Many Requests โ€” lower --concurrency, raise --delay, and keep --rate-limit auto (default) so SiteSavvy backs off automatically.
  • Large sites โ€” set --depth to bound the crawl, run with --dry-run first to estimate scope, and use --resume so an interruption doesn't waste work. --max-pages, --max-bytes and --max-time add hard budgets that stop the crawl cleanly and leave a resumable manifest behind.
  • PDF export fails โ€” WeasyPrint needs Pango/Cairo system libraries. On Debian/Ubuntu: apt install libpango-1.0-0 libpangoft2-1.0-0. On macOS: brew install pango. The other formats keep working even if PDF is missing.
  • Headless mode crashes โ€” run playwright install chromium once after installing the package. Without it, SiteSavvy transparently falls back to aiohttp.
  • robots.txt disallows โ€ฆ โ€” by default SiteSavvy honours robots.txt. Add --force only if you have permission and accept responsibility.
  • AI features silently skipped / No LLM API key set โ€” export SITESAVVY_LLM_API_KEY (or set it in your shell profile). Run sitesavvy info to confirm the key is detected. AI flags are no-ops without a key; the crawl itself still completes normally.
  • MCP server failed to start โ€” the optional mcp package is missing. Install it with pip install 'sitesavvy[mcp]' and try again.
  • Proxy issues โ€” --proxy accepts http://, https:// and socks5:// URLs. For SOCKS proxies, make sure the aiohttp-socks extra is installed (pip install aiohttp-socks). If requests time out, check that the proxy allows CONNECT to port 443.
  • Windows: long path errors โ€” enable long paths (New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1) or use a shorter --out-dir.
  • command not found: sitesavvy โ€” make sure pip install's bin / Scripts directory is on your PATH, or use python -m sitesavvy.

Legal & ethics

SiteSavvy is provided for personal, non-commercial use only. Respect the copyright, terms of service, and robots.txt of every site you crawl. AI features send extracted page content to your configured LLM provider โ€” make sure you have the right to do so for the sites you crawl. The authors assume no liability for misuse. Run sitesavvy legal to read the full disclaimer. Licensed under the MIT License.


Contributing

Pull requests are welcome! Please run the full check suite before submitting:

ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing

Coverage must stay at or above 90 %. See the Developer Guide for the project layout, release process and binary-building instructions.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitesavvy-0.6.0.tar.gz (104.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitesavvy-0.6.0-py3-none-any.whl (104.8 kB view details)

Uploaded Python 3

File details

Details for the file sitesavvy-0.6.0.tar.gz.

File metadata

  • Download URL: sitesavvy-0.6.0.tar.gz
  • Upload date:
  • Size: 104.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.6.0.tar.gz
Algorithm Hash digest
SHA256 f12940f00cf2271492d2cc529ff068f826589a4881cd04cf6a7f003ad08d44f1
MD5 32de9e93905c55a5be2b4152746be88f
BLAKE2b-256 8361028029241b58e99ab8404f76b16223816b957cbcf4d2087de17c38441afc

See more details on using hashes here.

File details

Details for the file sitesavvy-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: sitesavvy-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 104.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sitesavvy-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c9b575c847e427f84982f83f556f4c95acd60b94bfa72389865884f2c2e5754
MD5 b34026be644c4a4c883f1f1d1c6128da
BLAKE2b-256 3861e19554e8e42fb1168ad8404cff08040e9b61010b3d224a6fa5a21ac9d9bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page