Skip to main content

Mirror any website into a linked Obsidian vault.

Project description

Site2Vault

Mirror any website into a linked Obsidian vault. Built for Claude Code and other agentic tools to read documentation offline at near-zero token cost.

Python License: MIT

Site2Vault crawls a website, extracts the main content of each page as clean Markdown, and wires every internal link as an Obsidian [[wikilink]]. The result is a self-contained vault you can open in Obsidian, search across, and feed to Claude Code without paying for repeated web fetches.

Video overview: Why Site2Vault exists and how it works — the pattern of a self-built wiki as the AI context layer, the six-phase pipeline, and how the manifest enables section-level reads for coding agents.

flowchart LR
    A[Seed URL] --> B[Phase 1: Crawl]
    S[Sitemap] --> B
    B --> C[Phase 1.5: Deboilerplate]
    C --> D[Phase 2: Rewrite Links]
    D --> E[Phase 2.5: Byte Offsets]
    E --> F[Phase 3: Index MOCs]
    F --> G[Phase 4: Manifest]
    G --> V[(Obsidian Vault)]

Why

Feeding documentation to an agentic coder via repeated web fetches burns tokens. A 200-page docs site can cost dollars per query. Site2Vault pulls the entire corpus once into a local Markdown vault. From then on, the agent reads files at near-zero cost, navigates by [[wikilink]] exactly as on the original site, and uses the manifest to read only the sections it needs.

As of April 2026 there are quite a few other tools that mirror sites to Markdown, but to the best of my knowledge none produce a fully interlinked Obsidian vault with a machine-readable corpus index designed for agentic consumption. That gap is what site2vault fills.

Install

Standalone Windows executable (no Python required)

Download site2vault-windows.zip from the dist folder, extract, and add the folder to your PATH.

Via pipx (isolated, recommended for end users)

Coming soon — site2vault is not yet published to PyPI. Track #1 for the first release.

pipx install site2vault

Via pip (for development or embedding)

pip install -e .

Optional JS rendering for client-side-rendered sites:

pip install 'site2vault[js]'
playwright install chromium

Obsidian plugin

If you prefer a GUI, the obsidian-site2vault plugin wraps the CLI with a modal dialog, live log view, and settings tab — all inside Obsidian.

Quick start

# Mirror a docs site into your Obsidian vault
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs"

# Capture a single page
site2vault --url example.com/page --single

# Refresh an existing vault (uses conditional GET, skips unchanged pages)
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs" --refresh

The vault that results:

Example Docs/
├── .site2vault/manifest.json    Machine-readable corpus index
├── Index.md                     Root Map of Content
├── api/
│   ├── Index.md                 Folder MOC
│   └── Endpoints.md
├── getting-started/
│   ├── Index.md
│   └── Installation.md
└── log/                         Crawler internals (SQLite, headings, link sidecars)

Example of the OpenAI site in my Obsidian vault -

A note page in Obsidian showing extracted API documentation with frontmatter, wikilinks, and folder tree

Common recipes

# Full docs site with tags
site2vault --url docs.api.com --path ./vault --tag source/web --tag reference

# Restricted scope: only /api/* under the seed
site2vault --url example.com --include "^https://example\.com/api/" --depth 6

# Multiple sites in one vault, isolated by namespace
site2vault --url docs.example.com --path ./vault --namespace docs
site2vault --url blog.example.com --path ./vault --namespace blog

# Slow and polite for fragile sites
site2vault --url small-site.com --rate 0.3 --concurrency 1 --jitter 0.5

# Discovery only (no files written)
site2vault --url docs.example.com --dry-run --max-pages 100

See docs/cli-reference.md for every flag.

How it works

Six phases, orchestrated by orchestrator.py:

  1. Crawl — Breadth-first frontier with politeness controls. Fetches via httpx (HTTP/2), extracts main content with trafilatura, converts to Markdown with placeholder link tokens.
  2. Deboilerplate — Cross-page paragraph frequency analysis removes repeated cruft like "Edit on GitHub" footers.
  3. Rewrite — Replaces placeholder tokens with [[wikilinks]] for in-vault pages or [text](url) for external links.
  4. Byte offsets — Computes heading byte offsets in the final Markdown for section-level reading.
  5. Index — Generates root and per-folder Maps of Content.
  6. Manifest — Writes .site2vault/manifest.json with per-note metadata (headings, links, word counts, byte offsets).

For the full architecture, see docs/architecture.md.

For Claude Code

Site2Vault is designed for agentic consumption. The single most important fact:

Read .site2vault/manifest.json first. It tells you every note, every heading, every link, and every byte offset, without scanning a single Markdown file.

Then read only what you need, by section if possible:

# Find the section that covers what you need
import json
manifest = json.load(open("vault/.site2vault/manifest.json"))
for note in manifest["notes"]:
    for h in note["headings"]:
        if "authentication" in h["text"].lower():
            with open(f"vault/{note['file']}", "rb") as f:
                f.seek(h["start_byte"])
                section = f.read(h["end_byte"] - h["start_byte"]).decode()
            # done. one section, not the whole file.

Full guide: docs/claude-integration.md.

Documentation

License

MIT. See LICENSE.

Acknowledgments

Built on trafilatura, httpx, markdownify, and Obsidian.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

site2vault-0.1.0.tar.gz (103.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

site2vault-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file site2vault-0.1.0.tar.gz.

File metadata

  • Download URL: site2vault-0.1.0.tar.gz
  • Upload date:
  • Size: 103.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for site2vault-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d90c84c267b409ea84ea3c481afd0572d89824178f8f4d41ba0b527ab20ed081
MD5 1b1a72ae31cf8dd080bc29efccc1d2b9
BLAKE2b-256 23e28e2873b5bdbe4edcca6c1531239a34f845758623329035fb0c3f67865144

See more details on using hashes here.

File details

Details for the file site2vault-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: site2vault-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for site2vault-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45b38b1650203a35a068219d0bccb7fe6c06900c40746635e2714ea1e323c124
MD5 2b35c6ddd312e5f2c0a19536d0244168
BLAKE2b-256 446a59cf3312e57daeecaab15478e492f8ff87354043b303edec3644bff29e21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page