Mirror any website into a linked Obsidian vault.
Project description
Site2Vault
Mirror any website into a linked Obsidian vault. Built for Claude Code and other agentic tools to read documentation offline at near-zero token cost.
Site2Vault crawls a website, extracts the main content of each page as clean Markdown, and wires every internal link as an Obsidian [[wikilink]]. The result is a self-contained vault you can open in Obsidian, search across, and feed to Claude Code without paying for repeated web fetches.
Video overview: Why Site2Vault exists and how it works — the pattern of a self-built wiki as the AI context layer, the six-phase pipeline, and how the manifest enables section-level reads for coding agents.
flowchart LR
A[Seed URL] --> B[Phase 1: Crawl]
S[Sitemap] --> B
B --> C[Phase 1.5: Deboilerplate]
C --> D[Phase 2: Rewrite Links]
D --> E[Phase 2.5: Byte Offsets]
E --> F[Phase 3: Index MOCs]
F --> G[Phase 4: Manifest]
G --> V[(Obsidian Vault)]
Why
Feeding documentation to an agentic coder via repeated web fetches burns tokens. A 200-page docs site can cost dollars per query. Site2Vault pulls the entire corpus once into a local Markdown vault. From then on, the agent reads files at near-zero cost, navigates by [[wikilink]] exactly as on the original site, and uses the manifest to read only the sections it needs.
As of April 2026 there are quite a few other tools that mirror sites to Markdown, but to the best of my knowledge none produce a fully interlinked Obsidian vault with a machine-readable corpus index designed for agentic consumption. That gap is what site2vault fills.
Install
Standalone Windows executable (no Python required)
Download site2vault-windows.zip from the dist folder, extract, and add the folder to your PATH.
Via pipx (isolated, recommended for end users)
Coming soon —
site2vaultis not yet published to PyPI. Track #1 for the first release.
pipx install site2vault
Via pip (for development or embedding)
pip install -e .
Optional JS rendering for client-side-rendered sites:
pip install 'site2vault[js]'
playwright install chromium
Obsidian plugin
If you prefer a GUI, the obsidian-site2vault plugin wraps the CLI with a modal dialog, live log view, and settings tab — all inside Obsidian.
Quick start
# Mirror a docs site into your Obsidian vault
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs"
# Capture a single page
site2vault --url example.com/page --single
# Refresh an existing vault (uses conditional GET, skips unchanged pages)
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs" --refresh
The vault that results:
Example Docs/
├── .site2vault/manifest.json Machine-readable corpus index
├── Index.md Root Map of Content
├── api/
│ ├── Index.md Folder MOC
│ └── Endpoints.md
├── getting-started/
│ ├── Index.md
│ └── Installation.md
└── log/ Crawler internals (SQLite, headings, link sidecars)
Example of the OpenAI site in my Obsidian vault -
Common recipes
# Full docs site with tags
site2vault --url docs.api.com --path ./vault --tag source/web --tag reference
# Restricted scope: only /api/* under the seed
site2vault --url example.com --include "^https://example\.com/api/" --depth 6
# Multiple sites in one vault, isolated by namespace
site2vault --url docs.example.com --path ./vault --namespace docs
site2vault --url blog.example.com --path ./vault --namespace blog
# Slow and polite for fragile sites
site2vault --url small-site.com --rate 0.3 --concurrency 1 --jitter 0.5
# Discovery only (no files written)
site2vault --url docs.example.com --dry-run --max-pages 100
See docs/cli-reference.md for every flag.
How it works
Six phases, orchestrated by orchestrator.py:
- Crawl — Breadth-first frontier with politeness controls. Fetches via
httpx(HTTP/2), extracts main content withtrafilatura, converts to Markdown with placeholder link tokens. - Deboilerplate — Cross-page paragraph frequency analysis removes repeated cruft like "Edit on GitHub" footers.
- Rewrite — Replaces placeholder tokens with
[[wikilinks]]for in-vault pages or[text](url)for external links. - Byte offsets — Computes heading byte offsets in the final Markdown for section-level reading.
- Index — Generates root and per-folder Maps of Content.
- Manifest — Writes
.site2vault/manifest.jsonwith per-note metadata (headings, links, word counts, byte offsets).
For the full architecture, see docs/architecture.md.
For Claude Code
Site2Vault is designed for agentic consumption. The single most important fact:
Read
.site2vault/manifest.jsonfirst. It tells you every note, every heading, every link, and every byte offset, without scanning a single Markdown file.
Then read only what you need, by section if possible:
# Find the section that covers what you need
import json
manifest = json.load(open("vault/.site2vault/manifest.json"))
for note in manifest["notes"]:
for h in note["headings"]:
if "authentication" in h["text"].lower():
with open(f"vault/{note['file']}", "rb") as f:
f.seek(h["start_byte"])
section = f.read(h["end_byte"] - h["start_byte"]).decode()
# done. one section, not the whole file.
Full guide: docs/claude-integration.md.
Documentation
- Architecture — pipeline phases, module map, design decisions
- CLI reference — every flag, every default
- Claude integration — how to consume vault output efficiently
- Troubleshooting — common failures and fixes
- Development — setup, testing, building
License
MIT. See LICENSE.
Acknowledgments
Built on trafilatura, httpx, markdownify, and Obsidian.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file site2vault-0.1.0.tar.gz.
File metadata
- Download URL: site2vault-0.1.0.tar.gz
- Upload date:
- Size: 103.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d90c84c267b409ea84ea3c481afd0572d89824178f8f4d41ba0b527ab20ed081
|
|
| MD5 |
1b1a72ae31cf8dd080bc29efccc1d2b9
|
|
| BLAKE2b-256 |
23e28e2873b5bdbe4edcca6c1531239a34f845758623329035fb0c3f67865144
|
File details
Details for the file site2vault-0.1.0-py3-none-any.whl.
File metadata
- Download URL: site2vault-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45b38b1650203a35a068219d0bccb7fe6c06900c40746635e2714ea1e323c124
|
|
| MD5 |
2b35c6ddd312e5f2c0a19536d0244168
|
|
| BLAKE2b-256 |
446a59cf3312e57daeecaab15478e492f8ff87354043b303edec3644bff29e21
|