全自动Obsidian知识管理Pipeline - 生产级知识管理流水线
Project description
schema_version: "1.0.0" note_id: readme_en-5d661efc title: "Obsidian Vault Pipeline" description: "An auditable knowledge state runtime for Obsidian" date: 2026-04-07 type: meta
Obsidian Vault Pipeline
Current document version: v0.13.0
Primary docs:
- Architecture — durable knowledge model (six core terms; 250-line cap)
- Runtime — pipeline stages and CLIs by stage
- Packs — Core / Domain Pack / Workflow Profile
- Product Surfaces — UI / MCP / CLI / export
- Glossary — every other domain term, mapped back to the six core
- Milestone (简体中文)
- Active Backlog
- 简体中文: 架构
What This Is
Obsidian Vault Pipeline is not a loose collection of scripts, and it is not only RAG over Markdown. It is a local knowledge state runtime built around an Obsidian vault:
- Capture receives Pinboard, Clippings, raw Markdown, papers, GitHub repos, and web pages while keeping source lifecycle traceable.
- Compile turns material into deep dives, candidates, claims, evidence, relations, contradictions, registry rows, and graph rows.
- Reuse projects compiled knowledge into reader atlas pages, object pages, graph views, briefings, search, context packs, writing prompts, and the operator workbench.
Internally the runtime executes six pipeline stages: Ingest → Interpret → Absorb → Refine → Normalize → Derive (see RUNTIME). The product narrative is Capture → Compile → Reuse. The state model — Sources, Candidates, Canonical State, Projections, Access Surfaces, with Governance as the cross-cutting control plane — is documented in ARCHITECTURE.
The current release wires those layers into the actual runtime:
ovp --fullruns throughknowledge_indexby defaultovp --incrementalis the daily incremental entry point, including recent Pinboard + Clippings and downstream stagesovp --full --with-refineinsertsrefinebefore the final derived refreshovp-autopilotruns real-timeabsorb -> moc -> knowledge_indexovp-autopilot --with-refineaddsrefineto that pathovp-uiprovides a local UI. The default/entry is now a reader-first Knowledge Library, the operator dashboard lives under/ops, object pages expose source/backlink context, and/graph(also/map) renders a reader-facing knowledge map.
Why The Architecture Looks Like This
This repository started as a set of Obsidian automation scripts, but that model stopped scaling once the system grew:
- the main runtime and individual scripts drifted apart
- concepts, links, Atlas, graph, and retrieval indexes were tightly coupled without a clean truth boundary
- new domains like media, medical, or engineering research could not be modeled safely with a concept-only core
The current architecture is the direct answer to those failures:
- Capture → Compile → Reuse explains the product value
- The state model (Source / Candidate / Canonical State / Projection / Access Surface, with Governance as cross-cutting control) makes the trust boundary explicit; see ARCHITECTURE
- The six-stage runtime makes orchestration, identity normalization, and projection rebuilds explicit; see RUNTIME
research-techis the first standard built-in domain packdefault-knowledgeis retained as a compatibility pack for older vaults- The Pack API turns future domains into installable packs rather than hardcoded branches; see PACKS
So the project is no longer just a Vault automation repo. It is now:
a reader-first, evidence-backed knowledge atlas over an auditable knowledge state runtime
with:
research-techas the first explicit built-in standard packdefault-knowledgeretained as the default compatibility packknowledge.dbas a Projection (rebuildable from Canonical State, never authoritative)- vault markdown + registry + evidence chains as Canonical State (the long-term trust boundary)
Current Roadmap
OVP is evolving from a personal Zettelkasten into a typed knowledge platform — reader-first for humans, programmable for agents, extensible through domain packs.
- active backlog:
BACKLOG.md - current milestone:
MILESTONE.md - current merged roadmap rationale:
docs/plans/2026-04-29-consolidated-product-roadmap.md - reader product-shape note:
docs/plans/2026-04-29-reader-product-shape-and-backlog-reconciliation.md
Current milestone sequence:
| Milestone | Status | Meaning |
|---|---|---|
| M0–M3 | Done | Foundation, operator workbench, roadmap consolidation, reader-first atlas |
| M4 KSR Safety And Hot-Path Hardening | Done | projection labels, hot-path audit, wiring evals, evidence spans, candidate risk, JSONL streaming, projection lifecycle hardening |
| M5 Context Pack And Operational Runtime | Done | session snapshots, context budget, runtime state, runtime-state API, action queue health |
| M5a Quality And Dedup Hardening | Done | concept dedup pipeline integration, promote semantic guard, historical data cleanup |
| M8 Type Unification And Extraction Quality | Active | unified object kind taxonomy, Layer 1 entity_type, body-size-aware extraction, quote-grounding, single-pass LLM refactor |
| M9 Pack As Domain Ontology | Next | pack-defined object kind specs, typed relation constraints, schema registry |
| M10 Operational Knowledge Layer | Later | action types, permissions, cross-entity aggregation, decision memory |
| M11 Source Authority And Cross-Source Identity | Done | typed source-authority providers, entity layer (twitter_author / github_project / github_user / person / organization), runtime resolver, refresh wrapper, db backup (PRs #112–#124) |
| M12 Extraction-Time Entity Prime And Auto-Wikilink | Done | entity_aliases view, LLM extractor primed with known entities, auto-wikilink CLI (PRs #126–#128) |
| M13 Synthesis Layer (Crystal) | Done | Louvain communities + LLM-synthesized crystals + contradiction crystals + append-only versioning (PRs #130–#133, closes the L3 gap with NM 0.8) |
| M14 Intake Hardening (BL-058) | Done | URL preservation through deep-dive, deprecate legacy 13-section LLM rewrite, global URL dedup across the active staging chain (Clippings + 4 50-Inbox stages), audit-event stage field, fidelity-sample + prompt-ab measurement CLIs (PRs #170–#172, v0.13.0) |
Recent major changes (PRs #98–#124):
- JSONL streaming hardening, advisory file locks, runtime-state API fixes (#98–#100)
- Concept dedup pipeline + promote semantic guard, historical Evergreen cleanup (#101)
- Typed StepResult contracts + 4 pipeline guardrails (#109–#111)
- Liberate evergreen extractor prompt (#112) — no more 3-5 cap on atomic units per article
- Source authority subsystem (#113/#114): typed SignalProvider Protocol, domain/author whitelists, GitHub/arXiv/Twitter/Substack signals, yaml overrides, LLM-judge for new domains
- Entity layer (#115/#119/#120/#121/#123): twitter_author + github_project + github_user backfills, identity merge with person/organization split, runtime resolver — 1497 entities total on the OVP vault (521 twitter + 922 github + 54 person/organization), ~$0.10 one-shot
- Operational glue (#117/#122):
ovp-backup-dbSQLite online-backup snapshots,ovp-refresh-source-authoritychained refresh + launchd plist - 12 entity-layer review fixes (#124): read-side write side effects, identity-merge backlinks, lock race, append-only history, GitHub bare profile URLs, etc.
Domain Packs
The core runtime is now being formalized as a pack-aware platform.
- Built-in standard pack:
research-tech - Default compatibility pack:
default-knowledge - Runtime selection is exposed through
--packand--profile - Third-party packs can be discovered through the
ovp.packsentry point group or theOVP_PACK_MANIFESTSmanifest list
Examples:
ovp-packs
ovp-doctor --pack research-tech --json
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
ovp --pack default-knowledge --profile full
Pack API documentation for third-party developers lives in:
docs/pack-api/README.mddocs/pack-api/manifest-and-hooks.mddocs/pack-api/dogfooding-with-media-pack.md
Platform Architecture
From a platform perspective, the system now has three layers:
- Core Platform
- Domain Pack
- Workflow Profile
1. Core Platform
Core owns the cross-domain pieces that must remain stable:
- runtime / vault layout
- CLI orchestration
- autopilot / queue / watcher
- canonical identity helpers
- registry framework
- derived
knowledge.db - graph / lint / audit infrastructure
- plugin / pack loading
- base evidence schema contracts
2. Domain Pack
A pack is not just a prompt bundle. It defines domain semantics:
- object kinds
- workflow profiles
- discovery boundaries
- absorb / refine / lint rules
- schemas / templates / prompt resources
The built-in packs are:
research-tech: the explicit technical research pack and the default workflow packdefault-knowledge: the compatibility layer
Future domains such as media or medical should arrive as external pack projects.
3. Workflow Profile
A workflow profile is an executable DAG under a pack.
The built-in profiles currently shipped are:
research-tech/fullresearch-tech/autopilotdefault-knowledge/full
Research-Tech Operational Surface
research-tech is no longer only an internal pack. It now has a minimal operational surface:
ovp-doctorreports default workflow pack, pack roles, operator docs, recipes, and optional vault healthovp-exportexports minimal compiled artifacts:object-pagetopic-overviewevent-dossiercontradictions
ovp-truthreads object / contradiction / neighborhood truth rows directly fromknowledge.dbovp-uilaunches a local UI. The default/entry is the reader-first Knowledge Library; the operator dashboard lives under/ops.docs/research-tech/RESEARCH_TECH_SKILLPACK.mddocs/research-tech/RESEARCH_TECH_VERIFY.mddocs/recipes/research-tech/*.md
Examples:
ovp-doctor --pack research-tech --json
ovp-truth objects --vault-dir /path/to/vault
ovp-ui --vault-dir /path/to/vault --port 8787
ovp-export --pack research-tech --target topic-overview --output-path /tmp/topic.md
default-knowledge/autopilot
That is why the default workflow path now runs:
ovp --full
ovp-autopilot --yes
You can still select packs explicitly:
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
# compatibility path
ovp --pack default-knowledge --profile full
Plugin Design
The plugin / pack surface is no longer only a design memo. There is now a minimal working integration path.
Two discovery modes are supported:
- Python entry point group:
ovp.packs - Explicit manifest list:
OVP_PACK_MANIFESTS=/path/a.yaml:/path/b.yaml
The minimum third-party loading chain is:
- provide a manifest
- declare
entrypoints.pack - return a
BaseDomainPack - pass
api_versionvalidation - select it through
--pack/--profile
Hard boundaries currently enforced by core:
- a pack cannot turn semantic retrieval into canonical identity
- a pack cannot treat
knowledge.dbas Canonical State - a pack cannot bypass audit/logging
- all Projections must remain rebuildable
Runtime Model
Canonical State Boundary (full definition: ARCHITECTURE)
The system keeps a hard boundary:
- Canonical State: vault markdown + concept registry + evidence + audit
- Projections: Atlas, MOC, graph,
knowledge.db, lint, daily delta, crystals - not Canonical State:
knowledge.db
knowledge.db is a Projection. It stores:
- page FTS
- structured links
- mirrored raw sidecars
- timeline / audit events
- deterministic section embeddings
- read-only query / serve surfaces
It is rebuildable and does not own canonical identity resolution.
The Six Pipeline Stages (full description: RUNTIME)
| Stage | Responsibility | Representative commands | Can the LLM make major decisions here? |
|---|---|---|---|
| Ingest | Normalize incoming material | ovp --step pinboard ovp --step clippings ovp-article |
No |
| Interpret | Produce deep interpretations | ovp-article ovp-github ovp-paper |
Yes, with constrained output |
| Absorb | Compile interpretations into lifecycle actions | ovp-absorb ovp-evergreen |
Yes, but only through structured results |
| Refine | Cleanup and breakdown existing notes | ovp-cleanup ovp-breakdown |
Yes, but execution is controlled |
| Normalize | Maintain registry / aliases / identity merges / contradiction detection (formerly Canonical) | ovp-rebuild-registry ovp-merge-identities ovp-link-entities ovp-resolve-contradictions |
No |
| Derive | Build Projections — retrieval / graph / crystals / lint | ovp-knowledge-index ovp-graph ovp-synthesize-community-crystals ovp-lint |
No |
What ovp --full Actually Runs
Default full pipeline:
pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ knowledge_index
With refine enabled:
pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ refine
→ knowledge_index
Important details:
absorbshells toovp_pipeline.commands.absorband emitspromoted_slugsfor downstream stepsdedupruns post-absorb concept deduplication scoped to recently promoted slugs (trigram-Jaccard similarity)note_type_normalizenormalizes note_type metadata across Evergreen filesrefineis a batch wrapper overcleanup + breakdownknowledge_indexalways runs last soknowledge.dbreflects final canonical state--step evergreenand--from-step evergreenare still accepted and map toabsorb
What ovp-autopilot Actually Runs
Default real-time path:
interpretation
→ quality
→ absorb
→ moc
→ knowledge_index
→ auto_commit(optional)
Enable refine explicitly:
ovp-autopilot --watch=inbox --with-refine --yes
That changes the path to:
interpretation
→ quality
→ absorb
→ moc
→ refine
→ knowledge_index
→ auto_commit(optional)
Refine is not hidden or missing. It is wired in, but opt-in by default to avoid silent real-time structural rewrites of the whole knowledge base.
Command Overview
Daily entry points
| Command | Purpose |
|---|---|
ovp --check |
Validate runtime configuration |
ovp --full |
Run the full daily pipeline |
ovp --full --with-refine |
Run full pipeline plus cleanup/breakdown |
ovp --step absorb |
Run only the absorb layer |
ovp --step refine |
Run only the batch refine layer |
ovp --from-step absorb |
Resume from absorb onward |
Content processors
| Command | Purpose |
|---|---|
ovp-article --process-inbox --vault-dir <vault> |
Process raw documents |
ovp-github --process-single <file> --vault-dir <vault> |
Process GitHub inputs |
ovp-paper --process-single <file> --vault-dir <vault> |
Process paper inputs |
Absorb / Refine / Canonical
| Command | Purpose |
|---|---|
ovp-absorb --recent 7 --json |
Absorb recent deep dives |
ovp-absorb --file <source.md> --dry-run --json |
Preview source lifecycle routing before moving or processing source material |
ovp-evergreen --recent 7 --json |
Compatibility alias for ovp-absorb |
ovp-concept-dedup --vault-dir <vault> --threshold 0.82 |
Find and propose concept deduplication clusters |
ovp-concept-dedup --vault-dir <vault> --apply |
Apply deduplication proposal (archive losers, rewrite wikilinks) |
ovp-cleanup --all --json |
Generate cleanup proposals |
ovp-cleanup --all --write --json |
Apply deterministic cleanup |
ovp-breakdown --all --json |
Generate breakdown proposals |
ovp-breakdown --all --write --json |
Apply incremental breakdown |
ovp-rebuild-registry --json |
Reconcile evergreen notes and registry |
ovp-promote-candidates review |
Review candidate lifecycle |
ovp-moc --scan --vault-dir <vault> |
Refresh MOC / Atlas |
Derived layer
| Command | Purpose |
|---|---|
ovp-knowledge-index --json |
Rebuild knowledge.db |
ovp-knowledge-index --search "query" --json |
Run FTS search |
ovp-knowledge-index --query "question" --json |
Run embedding chunk query |
ovp-knowledge-index --get slug --json |
Read a canonical page |
ovp-knowledge-index --stats --json |
Read index stats |
ovp-knowledge-index --audit-recent --json |
Read recent audit events |
ovp-knowledge-index --tools-json |
Emit tool discovery JSON |
ovp-knowledge-index --serve |
Start read-only stdio JSONL service |
ovp-graph daily today --vault-dir <vault> |
Build daily graph delta |
ovp-lint --check --vault-dir <vault> |
Run structure/link checks |
Operations
| Command | Purpose |
|---|---|
ovp-runtime-state --vault-dir <vault> --write --json |
Build the operational runtime state projection from repair markers, workflow actions, pipeline events, and reuse events; writes 60-Logs/runtime-state/current.{json,md} |
GET /api/runtime-state |
Local read endpoint for the provider-facing runtime-state projection; prefers the materialized 60-Logs/runtime-state/current.json and falls back to rebuild when missing |
POST /api/runtime-state |
Refresh and write the materialized runtime-state projection |
Context packs
| Command | Purpose |
|---|---|
ovp-working-memory --vault-dir <vault> |
Write the daily budgeted context pack to 60-Logs/working-memory/YYYY-MM-DD.md and emit trusted reuse events for selected objects |
ovp-prime --vault-dir <vault> --session-id <id> |
Write an OVP Prime session snapshot to 60-Logs/session-snapshots/<id>.md, refresh latest.md, and emit ovp_prime reuse events |
AutoPilot
| Command | Purpose |
|---|---|
ovp-autopilot --watch=inbox --parallel=1 --yes |
Default real-time pipeline |
ovp-autopilot --watch=inbox,pinboard --yes |
Watch multiple sources |
ovp-autopilot --with-refine --yes |
Add refine to the real-time path |
ovp-autopilot --no-commit --yes |
Disable auto-commit |
Directory Layout
vault/
├── 50-Inbox/
│ ├── 01-Raw/
│ ├── 02-Pinboard/
│ └── 03-Processed/
├── 10-Knowledge/
│ ├── Evergreen/
│ └── Atlas/
│ ├── Atlas-Index.md
│ ├── concept-registry.jsonl
│ └── alias-index.json
├── 20-Areas/
│ └── {AI-Research, Investing, Programming, Tools}/Topics/YYYY-MM/
├── 60-Logs/
│ ├── pipeline.jsonl
│ ├── refine-mutations.jsonl
│ ├── transactions/
│ ├── quality-reports/
│ ├── daily-deltas/
│ ├── working-memory/
│ ├── session-snapshots/
│ ├── runtime-state/
│ └── knowledge.db
└── 70-Archive/
What knowledge.db Provides
knowledge.db is a rebuildable local derived index. It currently includes:
pages_indexpage_ftspage_linksraw_datatimeline_eventsaudit_eventspage_embeddings
It exists to power:
- keyword retrieval
- embedding retrieval
- canonical page reads
- audit browsing
- tool discovery and read-only serving
Default discovery now routes through this layer:
ovp-queryusesknowledge.dbby default- keyword retrieval uses FTS5 BM25
- semantic retrieval uses local deterministic embeddings
- QMD is no longer the default runtime dependency; it is opt-in via
--engine qmd
Quick Start
curl -fsSLO https://raw.githubusercontent.com/fakechris/obsidian_vault_pipeline/main/scripts/install-user.sh
less install-user.sh
bash install-user.sh
mkdir -p my-vault
cd my-vault
ovp --check
ovp --full
If you prefer the explicit PyPI two-step flow:
python3 -m pip install --user obsidian-vault-pipeline
python3 -m ovp_pipeline.installer
If your Python installation enforces PEP 668, prefer:
pipx install obsidian-vault-pipeline
The installer prefers a writable, safe bin directory that is already on PATH; if none is available, it falls back to ~/.local/bin. It does not edit your shell configuration.
If you want to see the refine layer explicitly:
ovp --full --with-refine
If you want a daemon:
ovp-autopilot --watch=inbox --parallel=1 --yes
Configuration
Put .env in the vault root:
AUTO_VAULT_API_KEY=your_key_here
AUTO_VAULT_API_BASE=https://api.minimaxi.com/anthropic
AUTO_VAULT_MODEL=anthropic/MiniMax-M2.7-highspeed
# Optional
PINBOARD_TOKEN=username:token
HTTP_PROXY=http://127.0.0.1:7897
Design Principles
- identity consistency before feature growth
- vault files + registry define Canonical State
knowledge.dbis a Projection, never an additional Canonical State- absorb is part of daily automation; refine is powerful and opt-in by default
- Wiki, MOC, dashboard, briefing, graph, reader pages, and context packs are projections that carry explicit projection metadata and must trace back to source/evidence
- reader-facing UI should explain knowledge first, then expose operator/debug detail
- docs must describe what actually ships, not a future architecture sketch
Related Resources
This document targets: v0.9.3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file obsidian_vault_pipeline-0.15.0.tar.gz.
File metadata
- Download URL: obsidian_vault_pipeline-0.15.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ef9daa2be2711046ba42981843b2c78a82d192b9cb1022a6f81807db939463b
|
|
| MD5 |
2447939bd8a2ae0cc17a1b24b58c29ed
|
|
| BLAKE2b-256 |
19195c907371184c0741e8e64c5021234ff8e570ed1d3c4c17b3574f58f41697
|
Provenance
The following attestation bundles were made for obsidian_vault_pipeline-0.15.0.tar.gz:
Publisher:
publish-pypi.yml on fakechris/obsidian_vault_pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_vault_pipeline-0.15.0.tar.gz -
Subject digest:
4ef9daa2be2711046ba42981843b2c78a82d192b9cb1022a6f81807db939463b - Sigstore transparency entry: 1474584109
- Sigstore integration time:
-
Permalink:
fakechris/obsidian_vault_pipeline@c90427ac7fb259ce4f999123db81d6731876b2ba -
Branch / Tag:
refs/tags/v0.15.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@c90427ac7fb259ce4f999123db81d6731876b2ba -
Trigger Event:
push
-
Statement type:
File details
Details for the file obsidian_vault_pipeline-0.15.0-py3-none-any.whl.
File metadata
- Download URL: obsidian_vault_pipeline-0.15.0-py3-none-any.whl
- Upload date:
- Size: 987.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b88f40dd02756acf0719a3a07972a1eeaba7dd9867ee5ae1f00d24e5fbb254e9
|
|
| MD5 |
37792792e463770e1a876d494bce91b9
|
|
| BLAKE2b-256 |
61bb73a6f7245cf2c82279beec942de182075804b2a2e1ed3af73182117626fc
|
Provenance
The following attestation bundles were made for obsidian_vault_pipeline-0.15.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on fakechris/obsidian_vault_pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_vault_pipeline-0.15.0-py3-none-any.whl -
Subject digest:
b88f40dd02756acf0719a3a07972a1eeaba7dd9867ee5ae1f00d24e5fbb254e9 - Sigstore transparency entry: 1474584231
- Sigstore integration time:
-
Permalink:
fakechris/obsidian_vault_pipeline@c90427ac7fb259ce4f999123db81d6731876b2ba -
Branch / Tag:
refs/tags/v0.15.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@c90427ac7fb259ce4f999123db81d6731876b2ba -
Trigger Event:
push
-
Statement type: