Mirror any website into a linked Obsidian vault.

These details have not been verified by PyPI

Project links

Project description

Site2Vault

Mirror any website into a linked Obsidian vault. Built for Claude Code and other agentic tools to read documentation offline at near-zero token cost.

Site2Vault crawls a website, extracts the main content of each page as clean Markdown, and wires every internal link as an Obsidian [[wikilink]]. The result is a self-contained vault you can open in Obsidian, search across, and feed to Claude Code without paying for repeated web fetches.

Video overview: Why Site2Vault exists and how it works — the pattern of a self-built wiki as the AI context layer, the six-phase pipeline, and how the manifest enables section-level reads for coding agents.

flowchart LR
    A[Seed URL] --> B[Phase 1: Crawl]
    S[Sitemap] --> B
    B --> C[Phase 1.5: Deboilerplate]
    C --> D[Phase 2: Rewrite Links]
    D --> E[Phase 2.5: Byte Offsets]
    E --> F[Phase 3: Index MOCs]
    F --> G[Phase 4: Manifest]
    G --> V[(Obsidian Vault)]

Why

Feeding documentation to an agentic coder via repeated web fetches burns tokens. A 200-page docs site can cost dollars per query. Site2Vault pulls the entire corpus once into a local Markdown vault. From then on, the agent reads files at near-zero cost, navigates by [[wikilink]] exactly as on the original site, and uses the manifest to read only the sections it needs.

As of April 2026 there are quite a few other tools that mirror sites to Markdown, but to the best of my knowledge none produce a fully interlinked Obsidian vault with a machine-readable corpus index designed for agentic consumption. That gap is what site2vault fills.

Install

Standalone Windows executable (no Python required)

Download site2vault-windows.zip from the dist folder, extract, and add the folder to your PATH.

Via pipx (isolated, recommended for end users)

Coming soon — site2vault is not yet published to PyPI. Track #1 for the first release.

pipx install site2vault

Via pip (for development or embedding)

pip install -e .

Optional JS rendering for client-side-rendered sites:

pip install 'site2vault[js]'
playwright install chromium

Obsidian plugin

If you prefer a GUI, the obsidian-site2vault plugin wraps the CLI with a modal dialog, live log view, and settings tab — all inside Obsidian.

Quick start

# Mirror a docs site into your Obsidian vault
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs"

# Capture a single page
site2vault --url example.com/page --single

# Refresh an existing vault (uses conditional GET, skips unchanged pages)
site2vault --url docs.example.com --path C:\Obsidian\Vault --name "Example Docs" --refresh

The vault that results:

Example Docs/
├── .site2vault/manifest.json    Machine-readable corpus index
├── Index.md                     Root Map of Content
├── api/
│   ├── Index.md                 Folder MOC
│   └── Endpoints.md
├── getting-started/
│   ├── Index.md
│   └── Installation.md
└── log/                         Crawler internals (SQLite, headings, link sidecars)

Example of the OpenAI site in my Obsidian vault -

A note page in Obsidian showing extracted API documentation with frontmatter, wikilinks, and folder tree

Common recipes

# Full docs site with tags
site2vault --url docs.api.com --path ./vault --tag source/web --tag reference

# Restricted scope: only /api/* under the seed
site2vault --url example.com --include "^https://example\.com/api/" --depth 6

# Multiple sites in one vault, isolated by namespace
site2vault --url docs.example.com --path ./vault --namespace docs
site2vault --url blog.example.com --path ./vault --namespace blog

# Slow and polite for fragile sites
site2vault --url small-site.com --rate 0.3 --concurrency 1 --jitter 0.5

# Discovery only (no files written)
site2vault --url docs.example.com --dry-run --max-pages 100

See docs/cli-reference.md for every flag.

How it works

Six phases, orchestrated by orchestrator.py:

Crawl — Breadth-first frontier with politeness controls. Fetches via httpx (HTTP/2), extracts main content with trafilatura, converts to Markdown with placeholder link tokens.
Deboilerplate — Cross-page paragraph frequency analysis removes repeated cruft like "Edit on GitHub" footers.
Rewrite — Replaces placeholder tokens with [[wikilinks]] for in-vault pages or [text](url) for external links.
Byte offsets — Computes heading byte offsets in the final Markdown for section-level reading.
Index — Generates root and per-folder Maps of Content.
Manifest — Writes .site2vault/manifest.json with per-note metadata (headings, links, word counts, byte offsets).

For the full architecture, see docs/architecture.md.

For Claude Code

Site2Vault is designed for agentic consumption. The single most important fact:

Read .site2vault/manifest.json first. It tells you every note, every heading, every link, and every byte offset, without scanning a single Markdown file.

Then read only what you need, by section if possible:

# Find the section that covers what you need
import json
manifest = json.load(open("vault/.site2vault/manifest.json"))
for note in manifest["notes"]:
    for h in note["headings"]:
        if "authentication" in h["text"].lower():
            with open(f"vault/{note['file']}", "rb") as f:
                f.seek(h["start_byte"])
                section = f.read(h["end_byte"] - h["start_byte"]).decode()
            # done. one section, not the whole file.

Full guide: docs/claude-integration.md.

Documentation

Architecture — pipeline phases, module map, design decisions
CLI reference — every flag, every default
Claude integration — how to consume vault output efficiently
Troubleshooting — common failures and fixes
Development — setup, testing, building

License

MIT. See LICENSE.

Acknowledgments

Built on trafilatura, httpx, markdownify, and Obsidian.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 21, 2026

This version

0.1.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

site2vault-0.1.0.tar.gz (103.2 MB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

site2vault-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file site2vault-0.1.0.tar.gz.

File metadata

Download URL: site2vault-0.1.0.tar.gz
Upload date: Jun 21, 2026
Size: 103.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for site2vault-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d90c84c267b409ea84ea3c481afd0572d89824178f8f4d41ba0b527ab20ed081`
MD5	`1b1a72ae31cf8dd080bc29efccc1d2b9`
BLAKE2b-256	`23e28e2873b5bdbe4edcca6c1531239a34f845758623329035fb0c3f67865144`

See more details on using hashes here.

File details

Details for the file site2vault-0.1.0-py3-none-any.whl.

File metadata

Download URL: site2vault-0.1.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 55.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for site2vault-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`45b38b1650203a35a068219d0bccb7fe6c06900c40746635e2714ea1e323c124`
MD5	`2b35c6ddd312e5f2c0a19536d0244168`
BLAKE2b-256	`446a59cf3312e57daeecaab15478e492f8ff87354043b303edec3644bff29e21`

See more details on using hashes here.

site2vault 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Site2Vault

Why

Install

Standalone Windows executable (no Python required)

Via pipx (isolated, recommended for end users)

Via pip (for development or embedding)

Obsidian plugin

Quick start

Common recipes

How it works

For Claude Code

Documentation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes