全自动Obsidian知识管理Pipeline - 生产级知识管理流水线
Project description
schema_version: "1.0.0" note_id: readme_en-5d661efc title: "Obsidian Vault Pipeline" description: "An auditable knowledge state runtime for Obsidian" date: 2026-04-07 type: meta
Obsidian Vault Pipeline
Current document version: v0.9.3
Primary docs:
What This Is
Obsidian Vault Pipeline is not a loose collection of scripts, and it is not only RAG over Markdown. It is a local knowledge state runtime built around an Obsidian vault:
- Capture receives Pinboard, Clippings, raw Markdown, papers, GitHub repos, and web pages while keeping source lifecycle traceable.
- Compile turns material into deep dives, candidates, claims, evidence, relations, contradictions, registry rows, and graph rows.
- Reuse projects compiled knowledge into reader atlas pages, object pages, graph views, briefings, search, context packs, writing prompts, and the operator workbench.
Internally the engineering model still uses six layers: Ingest -> Interpret -> Absorb -> Refine -> Canonical -> Derived. The product narrative is Capture -> Compile -> Reuse.
The current release wires those layers into the actual runtime:
ovp --fullruns throughknowledge_indexby defaultovp --incrementalis the daily incremental entry point, including recent Pinboard + Clippings and downstream stagesovp --full --with-refineinsertsrefinebefore the final derived refreshovp-autopilotruns real-timeabsorb -> moc -> knowledge_indexovp-autopilot --with-refineaddsrefineto that pathovp-uiprovides a local UI. The default/entry is now a reader-first Knowledge Library, the operator dashboard lives under/ops, object pages expose source/backlink context, and/graph(also/map) renders a reader-facing knowledge map.
Why The Architecture Looks Like This
This repository started as a set of Obsidian automation scripts, but that model stopped scaling once the system grew:
- the main runtime and individual scripts drifted apart
- concepts, links, Atlas, graph, and retrieval indexes were tightly coupled without a clean truth boundary
- new domains like media, medical, or engineering research could not be modeled safely with a concept-only core
The current architecture is the direct answer to those failures:
- Capture -> Compile -> Reuse explains the product value
- source -> observation -> claim -> evidence -> validity -> projection -> permission explains the long-term knowledge state
- the six-layer runtime makes orchestration, canonical state, and derived state explicit
research-techmakes the current engineering research semantics explicitdefault-knowledgeis being reduced to a default compatibility layer instead of carrying every domain semantic- Pack API turns future domains into installable packs rather than more hardcoded branches inside the runtime
So the project is no longer just a Vault automation repo. It is now:
a reader-first, evidence-backed knowledge atlas over an auditable knowledge state runtime
with:
research-techas the first explicit built-in standard packdefault-knowledgeretained as the default compatibility packknowledge.dbas a derived store, never Authority- vault markdown + registry + evidence chains as the long-term trust boundary
Current Roadmap
OVP is evolving from a personal Zettelkasten into a typed knowledge platform — reader-first for humans, programmable for agents, extensible through domain packs.
- active backlog:
BACKLOG.md - current milestone:
MILESTONE.md - current merged roadmap rationale:
docs/plans/2026-04-29-consolidated-product-roadmap.md - reader product-shape note:
docs/plans/2026-04-29-reader-product-shape-and-backlog-reconciliation.md
Current milestone sequence:
| Milestone | Status | Meaning |
|---|---|---|
| M0–M3 | Done | Foundation, operator workbench, roadmap consolidation, reader-first atlas |
| M4 KSR Safety And Hot-Path Hardening | Done | projection labels, hot-path audit, wiring evals, evidence spans, candidate risk, JSONL streaming, projection lifecycle hardening |
| M5 Context Pack And Operational Runtime | Done | session snapshots, context budget, runtime state, runtime-state API, action queue health |
| M5a Quality And Dedup Hardening | Done | concept dedup pipeline integration, promote semantic guard, historical data cleanup |
| M8 Type Unification And Extraction Quality | Active | unified object kind taxonomy, Layer 1 entity_type, body-size-aware extraction, quote-grounding, single-pass LLM refactor |
| M9 Pack As Domain Ontology | Next | pack-defined object kind specs, typed relation constraints, schema registry |
| M10 Operational Knowledge Layer | Later | action types, permissions, cross-entity aggregation, decision memory |
Recent major changes (PRs #98–#101):
- JSONL streaming hardening, advisory file locks, runtime-state API fixes
- Four-phase architecture refactor: module boundary cleanup, route hardening (CSP/CSRF), projection lifecycle
- Concept dedup pipeline integration with scoped
scope_slugsparameter - Promote semantic guard: trigram-Jaccard pre-check merges near-duplicate candidates into existing Evergreens
- Historical Evergreen data cleanup (71→61 active Evergreens)
find_similar_slugsutility for similarity checking
Domain Packs
The core runtime is now being formalized as a pack-aware platform.
- Built-in standard pack:
research-tech - Default compatibility pack:
default-knowledge - Runtime selection is exposed through
--packand--profile - Third-party packs can be discovered through the
ovp.packsentry point group or theOVP_PACK_MANIFESTSmanifest list
Examples:
ovp-packs
ovp-doctor --pack research-tech --json
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
ovp --pack default-knowledge --profile full
Pack API documentation for third-party developers lives in:
docs/pack-api/README.mddocs/pack-api/manifest-and-hooks.mddocs/pack-api/dogfooding-with-media-pack.md
Platform Architecture
From a platform perspective, the system now has three layers:
- Core Platform
- Domain Pack
- Workflow Profile
1. Core Platform
Core owns the cross-domain pieces that must remain stable:
- runtime / vault layout
- CLI orchestration
- autopilot / queue / watcher
- canonical identity helpers
- registry framework
- derived
knowledge.db - graph / lint / audit infrastructure
- plugin / pack loading
- base evidence schema contracts
2. Domain Pack
A pack is not just a prompt bundle. It defines domain semantics:
- object kinds
- workflow profiles
- discovery boundaries
- absorb / refine / lint rules
- schemas / templates / prompt resources
The built-in packs are:
research-tech: the explicit technical research pack and the default workflow packdefault-knowledge: the compatibility layer
Future domains such as media or medical should arrive as external pack projects.
3. Workflow Profile
A workflow profile is an executable DAG under a pack.
The built-in profiles currently shipped are:
research-tech/fullresearch-tech/autopilotdefault-knowledge/full
Research-Tech Operational Surface
research-tech is no longer only an internal pack. It now has a minimal operational surface:
ovp-doctorreports default workflow pack, pack roles, operator docs, recipes, and optional vault healthovp-exportexports minimal compiled artifacts:object-pagetopic-overviewevent-dossiercontradictions
ovp-truthreads object / contradiction / neighborhood truth rows directly fromknowledge.dbovp-uilaunches a local UI. The default/entry is the reader-first Knowledge Library; the operator dashboard lives under/ops.docs/research-tech/RESEARCH_TECH_SKILLPACK.mddocs/research-tech/RESEARCH_TECH_VERIFY.mddocs/recipes/research-tech/*.md
Examples:
ovp-doctor --pack research-tech --json
ovp-truth objects --vault-dir /path/to/vault
ovp-ui --vault-dir /path/to/vault --port 8787
ovp-export --pack research-tech --target topic-overview --output-path /tmp/topic.md
default-knowledge/autopilot
That is why the default workflow path now runs:
ovp --full
ovp-autopilot --yes
You can still select packs explicitly:
ovp --pack research-tech --profile full
ovp-autopilot --pack research-tech --profile autopilot --yes
# compatibility path
ovp --pack default-knowledge --profile full
Plugin Design
The plugin / pack surface is no longer only a design memo. There is now a minimal working integration path.
Two discovery modes are supported:
- Python entry point group:
ovp.packs - Explicit manifest list:
OVP_PACK_MANIFESTS=/path/a.yaml:/path/b.yaml
The minimum third-party loading chain is:
- provide a manifest
- declare
entrypoints.pack - return a
BaseDomainPack - pass
api_versionvalidation - select it through
--pack/--profile
Hard boundaries currently enforced by core:
- a pack cannot turn semantic retrieval into canonical identity
- a pack cannot treat
knowledge.dbas Authority - a pack cannot bypass audit/logging
- all derived state must remain rebuildable
Runtime Model
Authority Boundary
The system keeps a hard boundary:
- Authority: vault markdown + concept registry
- derived views: Atlas, MOC, graph,
knowledge.db, lint, daily delta - not Authority:
knowledge.db
knowledge.db is the GBrain-inspired derived index layer. It stores:
- page FTS
- structured links
- mirrored raw sidecars
- timeline / audit events
- deterministic section embeddings
- read-only query / serve surfaces
It is rebuildable and does not own canonical identity resolution.
The Six Layers
| Layer | Responsibility | Representative commands | Can the LLM make major decisions here? |
|---|---|---|---|
| Ingest | Normalize incoming material | ovp --step pinboard ovp --step clippings ovp-article |
No |
| Interpret | Produce deep interpretations | ovp-article ovp-github ovp-paper |
Yes, with constrained output |
| Absorb | Compile interpretations into lifecycle actions | ovp-absorb ovp-evergreen |
Yes, but only through structured results |
| Refine | Cleanup and breakdown existing notes | ovp-cleanup ovp-breakdown |
Yes, but execution is controlled |
| Canonical | Maintain registry / aliases / Atlas / MOC | ovp-rebuild-registry ovp-moc ovp-promote-candidates |
No |
| Derived | Build retrieval / graph / lint views | ovp-knowledge-index ovp-graph ovp-lint |
No |
What ovp --full Actually Runs
Default full pipeline:
pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ knowledge_index
With refine enabled:
pinboard
→ pinboard_process
→ clippings
→ articles
→ quality
→ fix_links
→ absorb
→ dedup
→ note_type_normalize
→ registry_sync
→ moc
→ refine
→ knowledge_index
Important details:
absorbshells toovp_pipeline.commands.absorband emitspromoted_slugsfor downstream stepsdedupruns post-absorb concept deduplication scoped to recently promoted slugs (trigram-Jaccard similarity)note_type_normalizenormalizes note_type metadata across Evergreen filesrefineis a batch wrapper overcleanup + breakdownknowledge_indexalways runs last soknowledge.dbreflects final canonical state--step evergreenand--from-step evergreenare still accepted and map toabsorb
What ovp-autopilot Actually Runs
Default real-time path:
interpretation
→ quality
→ absorb
→ moc
→ knowledge_index
→ auto_commit(optional)
Enable refine explicitly:
ovp-autopilot --watch=inbox --with-refine --yes
That changes the path to:
interpretation
→ quality
→ absorb
→ moc
→ refine
→ knowledge_index
→ auto_commit(optional)
Refine is not hidden or missing. It is wired in, but opt-in by default to avoid silent real-time structural rewrites of the whole knowledge base.
Command Overview
Daily entry points
| Command | Purpose |
|---|---|
ovp --check |
Validate runtime configuration |
ovp --full |
Run the full daily pipeline |
ovp --full --with-refine |
Run full pipeline plus cleanup/breakdown |
ovp --step absorb |
Run only the absorb layer |
ovp --step refine |
Run only the batch refine layer |
ovp --from-step absorb |
Resume from absorb onward |
Content processors
| Command | Purpose |
|---|---|
ovp-article --process-inbox --vault-dir <vault> |
Process raw documents |
ovp-github --process-single <file> --vault-dir <vault> |
Process GitHub inputs |
ovp-paper --process-single <file> --vault-dir <vault> |
Process paper inputs |
Absorb / Refine / Canonical
| Command | Purpose |
|---|---|
ovp-absorb --recent 7 --json |
Absorb recent deep dives |
ovp-absorb --file <source.md> --dry-run --json |
Preview source lifecycle routing before moving or processing source material |
ovp-evergreen --recent 7 --json |
Compatibility alias for ovp-absorb |
ovp-concept-dedup --vault-dir <vault> --threshold 0.82 |
Find and propose concept deduplication clusters |
ovp-concept-dedup --vault-dir <vault> --apply |
Apply deduplication proposal (archive losers, rewrite wikilinks) |
ovp-cleanup --all --json |
Generate cleanup proposals |
ovp-cleanup --all --write --json |
Apply deterministic cleanup |
ovp-breakdown --all --json |
Generate breakdown proposals |
ovp-breakdown --all --write --json |
Apply incremental breakdown |
ovp-rebuild-registry --json |
Reconcile evergreen notes and registry |
ovp-promote-candidates review |
Review candidate lifecycle |
ovp-moc --scan --vault-dir <vault> |
Refresh MOC / Atlas |
Derived layer
| Command | Purpose |
|---|---|
ovp-knowledge-index --json |
Rebuild knowledge.db |
ovp-knowledge-index --search "query" --json |
Run FTS search |
ovp-knowledge-index --query "question" --json |
Run embedding chunk query |
ovp-knowledge-index --get slug --json |
Read a canonical page |
ovp-knowledge-index --stats --json |
Read index stats |
ovp-knowledge-index --audit-recent --json |
Read recent audit events |
ovp-knowledge-index --tools-json |
Emit tool discovery JSON |
ovp-knowledge-index --serve |
Start read-only stdio JSONL service |
ovp-graph daily today --vault-dir <vault> |
Build daily graph delta |
ovp-lint --check --vault-dir <vault> |
Run structure/link checks |
Operations
| Command | Purpose |
|---|---|
ovp-runtime-state --vault-dir <vault> --write --json |
Build the operational runtime state projection from repair markers, workflow actions, pipeline events, and reuse events; writes 60-Logs/runtime-state/current.{json,md} |
GET /api/runtime-state |
Local read endpoint for the provider-facing runtime-state projection; prefers the materialized 60-Logs/runtime-state/current.json and falls back to rebuild when missing |
POST /api/runtime-state |
Refresh and write the materialized runtime-state projection |
Context packs
| Command | Purpose |
|---|---|
ovp-working-memory --vault-dir <vault> |
Write the daily budgeted context pack to 60-Logs/working-memory/YYYY-MM-DD.md and emit trusted reuse events for selected objects |
ovp-prime --vault-dir <vault> --session-id <id> |
Write an OVP Prime session snapshot to 60-Logs/session-snapshots/<id>.md, refresh latest.md, and emit ovp_prime reuse events |
AutoPilot
| Command | Purpose |
|---|---|
ovp-autopilot --watch=inbox --parallel=1 --yes |
Default real-time pipeline |
ovp-autopilot --watch=inbox,pinboard --yes |
Watch multiple sources |
ovp-autopilot --with-refine --yes |
Add refine to the real-time path |
ovp-autopilot --no-commit --yes |
Disable auto-commit |
Directory Layout
vault/
├── 50-Inbox/
│ ├── 01-Raw/
│ ├── 02-Pinboard/
│ └── 03-Processed/
├── 10-Knowledge/
│ ├── Evergreen/
│ └── Atlas/
│ ├── Atlas-Index.md
│ ├── concept-registry.jsonl
│ └── alias-index.json
├── 20-Areas/
│ └── {AI-Research, Investing, Programming, Tools}/Topics/YYYY-MM/
├── 60-Logs/
│ ├── pipeline.jsonl
│ ├── refine-mutations.jsonl
│ ├── transactions/
│ ├── quality-reports/
│ ├── daily-deltas/
│ ├── working-memory/
│ ├── session-snapshots/
│ ├── runtime-state/
│ └── knowledge.db
└── 70-Archive/
What knowledge.db Provides
knowledge.db is a rebuildable local derived index. It currently includes:
pages_indexpage_ftspage_linksraw_datatimeline_eventsaudit_eventspage_embeddings
It exists to power:
- keyword retrieval
- embedding retrieval
- canonical page reads
- audit browsing
- tool discovery and read-only serving
Default discovery now routes through this layer:
ovp-queryusesknowledge.dbby default- keyword retrieval uses FTS5 BM25
- semantic retrieval uses local deterministic embeddings
- QMD is no longer the default runtime dependency; it is opt-in via
--engine qmd
Quick Start
curl -fsSLO https://raw.githubusercontent.com/fakechris/obsidian_vault_pipeline/main/scripts/install-user.sh
less install-user.sh
bash install-user.sh
mkdir -p my-vault
cd my-vault
ovp --check
ovp --full
If you prefer the explicit PyPI two-step flow:
python3 -m pip install --user obsidian-vault-pipeline
python3 -m ovp_pipeline.installer
If your Python installation enforces PEP 668, prefer:
pipx install obsidian-vault-pipeline
The installer prefers a writable, safe bin directory that is already on PATH; if none is available, it falls back to ~/.local/bin. It does not edit your shell configuration.
If you want to see the refine layer explicitly:
ovp --full --with-refine
If you want a daemon:
ovp-autopilot --watch=inbox --parallel=1 --yes
Configuration
Put .env in the vault root:
AUTO_VAULT_API_KEY=your_key_here
AUTO_VAULT_API_BASE=https://api.minimaxi.com/anthropic
AUTO_VAULT_MODEL=anthropic/MiniMax-M2.7-highspeed
# Optional
PINBOARD_TOKEN=username:token
HTTP_PROXY=http://127.0.0.1:7897
Design Principles
- identity consistency before feature growth
- vault files + registry define canonical state
knowledge.dbis derived retrieval, never a second Authority- absorb is part of daily automation; refine is powerful and opt-in by default
- Wiki, MOC, dashboard, briefing, graph, reader pages, and context packs are projections that carry explicit projection metadata and must trace back to source/evidence
- reader-facing UI should explain knowledge first, then expose operator/debug detail
- docs must describe what actually ships, not a future architecture sketch
Related Resources
This document targets: v0.9.3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file obsidian_vault_pipeline-0.10.0.tar.gz.
File metadata
- Download URL: obsidian_vault_pipeline-0.10.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fac8a7177b13d93f24366dc064c1c2320818fc7a9e8d27304cb4a02ec9a0ec28
|
|
| MD5 |
d10577e11b83f177661de8e93ffcf0cb
|
|
| BLAKE2b-256 |
80c516f7375b87313d253c2b4afb7ea4a5cd727342848a9907f3a9e710259a2e
|
Provenance
The following attestation bundles were made for obsidian_vault_pipeline-0.10.0.tar.gz:
Publisher:
publish-pypi.yml on fakechris/obsidian_vault_pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_vault_pipeline-0.10.0.tar.gz -
Subject digest:
fac8a7177b13d93f24366dc064c1c2320818fc7a9e8d27304cb4a02ec9a0ec28 - Sigstore transparency entry: 1418126890
- Sigstore integration time:
-
Permalink:
fakechris/obsidian_vault_pipeline@a311c6a02c7fccab63e602d4ca0747bd7a0682af -
Branch / Tag:
refs/tags/v0.10.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a311c6a02c7fccab63e602d4ca0747bd7a0682af -
Trigger Event:
push
-
Statement type:
File details
Details for the file obsidian_vault_pipeline-0.10.0-py3-none-any.whl.
File metadata
- Download URL: obsidian_vault_pipeline-0.10.0-py3-none-any.whl
- Upload date:
- Size: 624.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
284e3bc65e63d16ed51b99788e36c3b4c28915b681ede961c940b9ba3f08b901
|
|
| MD5 |
d46cb0bdd121a2c7f68ad615e9746423
|
|
| BLAKE2b-256 |
a76f5bb8f2329ee325bf3ab0bbc170bd86a19d725e69af51e2dbefe6360fa91b
|
Provenance
The following attestation bundles were made for obsidian_vault_pipeline-0.10.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on fakechris/obsidian_vault_pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
obsidian_vault_pipeline-0.10.0-py3-none-any.whl -
Subject digest:
284e3bc65e63d16ed51b99788e36c3b4c28915b681ede961c940b9ba3f08b901 - Sigstore transparency entry: 1418126967
- Sigstore integration time:
-
Permalink:
fakechris/obsidian_vault_pipeline@a311c6a02c7fccab63e602d4ca0747bd7a0682af -
Branch / Tag:
refs/tags/v0.10.0 - Owner: https://github.com/fakechris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a311c6a02c7fccab63e602d4ca0747bd7a0682af -
Trigger Event:
push
-
Statement type: