Knowledge governance plugin for OpenViking — coverage assessment, external search, review, ingest, dedup, and feedback-driven ranking.
Project description
OpenViking Curator
English / 中文
Knowledge governance plugin for OpenViking. Curator sits on top of your OV knowledge base — it decides when local knowledge is enough, when to search externally, reviews what comes back, and ingests the good stuff. Your knowledge base grows with every question.
How it works
flowchart TD
Q[Query] --> R[Route]
R --> OV[Retrieve from OV]
OV --> L[Load L0 → L1 → L2]
L --> COV{Coverage OK?}
COV -- yes --> OUT1[Return local context]
COV -- no --> EXT[External search]
EXT --> F{Need fresh?}
F -- yes --> CV[Cross-validate]
F -- no --> J
CV --> J[Judge + conflict]
J --> P{Pass?}
P -- no --> OUT1
P -- yes --> C{Conflict?}
C -- blocked --> OUT1
C -- ok --> ING[Ingest + verify]
ING --> OUT2[Return merged context]
LLM call strategy:
- Coverage sufficient → 0 LLM calls, return immediately
- External search triggered → 1 LLM call (judge + conflict combined)
- Need freshness validation → 2 LLM calls (+ cross-validate)
Features
| Feature | Description | Module |
|---|---|---|
| Rule-based routing | Domain, keywords, freshness detection. No LLM needed. JSON-configurable. | router.py |
| Dual-path retrieval | find (vector) + search (LLM intent). URI dedup. |
retrieval_v2.py |
| On-demand loading | L0 (abstract) → L1 (overview) → L2 (full). Only deeper when needed. | retrieval_v2.py |
| Coverage assessment | Score gap + keyword overlap signals. 0 LLM calls. | retrieval_v2.py |
| External search | Pluggable providers: Grok, DuckDuckGo, Tavily. Fallback chain or concurrent. | search_providers.py |
| Domain filtering | Whitelist / blacklist external search results by domain. | domain_filter.py |
| Cross-validation | When need_fresh=true. Flags risky/outdated claims. |
search.py |
| Judge + conflict | Single LLM call: trust 0-10, freshness, pass/fail, contradiction. Pydantic validated. | review.py |
| Conflict resolution | Configurable: auto / local / external / human. Bidirectional scoring. |
pipeline_v2.py |
| Ingest | Writes to OV with metadata (source URLs, version, TTL, quality feedback). | review.py |
| Async ingest | Fire-and-forget background thread. Observable via async job tracker. | pipeline_v2.py + async_jobs.py |
| Auto-summarize | Generates L0/L1 summaries on ingest for new resources. | pipeline_v2.py |
| Dedup scanning | URL hash (exact match) + Jaccard word similarity. Reports only. | dedup.py |
| Freshness scoring | URI timestamp → decay score. Configurable thresholds. | freshness.py |
| Usage-based TTL | Hot / warm / cold tiers. Frequently used → longer TTL. | usage_ttl.py |
| Feedback reranking | up/down/adopt per URI. Score-weighted, rank-aware. OV score stays dominant. |
feedback_store.py |
| Decision report | ASCII box, single-line, JSON, HTML export. Always present in run() output. |
decision_report.py |
| Session tracking | Records queries + used URIs. Commits to extract long-term memory. | session_manager.py |
| Query logging | Every query → query_log.jsonl with coverage, reasons, LLM calls. |
pipeline_v2.py |
| Circuit breaker | 3-state breaker wrapping LLM + search calls. Auto-recovery. | circuit_breaker.py |
| Search cache | LRU + dual TTL. File-locked JSON persistence. | search_cache.py |
| Automated governance | Weekly cycle: audit, flag, proactive search, report. Hybrid async. No auto-deletion. | governance.py |
| Interest analysis | Extract user interests from query log + feedback. Generate proactive search queries. | interest_analyzer.py |
| Background scheduler | APScheduler: periodic freshness scan + weak topic strengthening + governance. | scheduler.py |
| Structured logging | structlog with JSON mode. Per-run context binding (run_id, query). | logging_setup.py |
What Curator does NOT do
- Vector search / indexing → OpenViking handles this
- Answer generation → your LLM; Curator returns structured context, not answers
Quick Start
Prerequisites
- Python 3.10+
- A working OpenViking setup (embedded or HTTP mode)
- An OpenAI-compatible API endpoint (for LLM review)
- A search API (Grok recommended, or DuckDuckGo/Tavily)
Install
git clone https://github.com/ponsde/OpenViking_Curator.git
cd OpenViking_Curator
# Recommended: uv (fast, reproducible)
uv sync
source .venv/bin/activate
# Alternative: pip
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # then fill in your keys
Configure
Edit .env:
# OpenViking config (embedded mode)
OPENVIKING_CONFIG_FILE=/path/to/your/ov.conf
# LLM for review & routing (any OpenAI-compatible endpoint)
CURATOR_OAI_BASE=https://your-llm-api.com/v1
CURATOR_OAI_KEY=sk-your-key
# External search (Grok recommended)
CURATOR_GROK_BASE=https://your-grok-endpoint/v1
CURATOR_GROK_KEY=your-grok-key
Three endpoints, three keys. That's it.
Run
python3 curator_query.py --status # Health check
python3 curator_query.py "How to deploy Redis in Docker?" # Query
python3 curator_query.py --review "sensitive topic" # Review mode (no auto-ingest)
python3 mcp_server.py # MCP server (stdio JSON-RPC)
Docker
# Embedded mode: OV runs in-process
cp ov.conf.example ov.conf # fill in your embedding API keys
cp .env.example .env # fill in LLM + search keys
docker compose build
docker compose run --rm curator curator_query.py --status
docker compose run --rm curator curator_query.py "your question"
# HTTP mode: OV runs as a separate service
cp ov.conf.example ov.conf # keep as-is (not read in HTTP mode)
echo "OV_BASE_URL=http://your-ov-host:8080" >> .env
docker compose build
docker compose run --rm curator curator_query.py --status
Python API
from curator.pipeline_v2 import run
result = run("Nginx reverse proxy with SSL?")
print(result["context_text"]) # local context
print(result["external_text"]) # external (if any)
print(result["coverage"]) # 0.0 ~ 1.0
print(result["meta"]["ingested"]) # True if new content stored
print(result["conflict"]) # conflict detection
print(result["decision_report"]) # ASCII decision report
# Reusable pipeline instance (shares session + backend)
from curator.pipeline_v2 import CuratorPipeline
pipeline = CuratorPipeline()
r1 = pipeline.run("how to deploy?")
r2 = pipeline.run("what is RAG?")
# Decision report in other formats
from curator.decision_report import format_report_json, format_report_html
print(format_report_json(result))
print(format_report_html(result))
# Feedback (boosts retrieval ranking next time)
from curator.feedback_store import apply
apply("viking://resources/doc-id", "up") # mark helpful
apply("viking://resources/doc-id", "down") # mark unhelpful
Configuration
All via .env (git-ignored). See .env.example for a full template.
Core (required)
| Variable | Description |
|---|---|
OPENVIKING_CONFIG_FILE |
Path to ov.conf (embedded mode) |
CURATOR_OAI_BASE |
OpenAI-compatible API base URL |
CURATOR_OAI_KEY |
API key for above |
CURATOR_GROK_KEY |
Grok API key (external search) |
OV mode
| Variable | Default | Description |
|---|---|---|
OV_BASE_URL |
(empty) | Set to use HTTP mode. Empty = embedded mode. |
OV_DATA_PATH |
./data |
OV data directory (embedded mode) |
Search
| Variable | Default | Description |
|---|---|---|
CURATOR_SEARCH_PROVIDERS |
grok |
Comma-separated: grok,duckduckgo,tavily (fallback chain) |
CURATOR_SEARCH_CONCURRENT |
0 |
1 = fire all providers in parallel |
CURATOR_SEARCH_TIMEOUT |
60 |
Global search timeout (seconds) |
CURATOR_TAVILY_KEY |
(empty) | Tavily API key (if using tavily provider) |
CURATOR_ALLOWED_DOMAINS |
(empty) | Whitelist (comma-separated) |
CURATOR_BLOCKED_DOMAINS |
(empty) | Blacklist (comma-separated) |
Thresholds
| Variable | Default | Effect |
|---|---|---|
CURATOR_THRESHOLD_COV_SUFFICIENT |
0.55 |
Above = skip external search |
CURATOR_THRESHOLD_COV_MARGINAL |
0.45 |
Above = marginal (still searches) |
CURATOR_THRESHOLD_COV_LOW |
0.35 |
Below = definitely search |
CURATOR_THRESHOLD_L0_SUFFICIENT |
0.62 |
L0 score to skip L1 |
CURATOR_THRESHOLD_L1_SUFFICIENT |
0.50 |
L1 score to skip L2 |
CURATOR_MAX_L2_DEPTH |
2 |
Max full-text reads per run |
Background scheduler
| Variable | Default | Description |
|---|---|---|
CURATOR_SCHEDULER_ENABLED |
0 |
1 to activate background jobs |
CURATOR_FRESHNESS_INTERVAL_HOURS |
24 |
Freshness scan interval |
CURATOR_STRENGTHEN_INTERVAL_HOURS |
168 |
Weak topic strengthening interval (7 days) |
CURATOR_STRENGTHEN_TOP_N |
3 |
Number of weak topics per run |
Governance
| Variable | Default | Description |
|---|---|---|
CURATOR_GOVERNANCE_ENABLED |
0 |
1 to enable governance cycle |
CURATOR_GOVERNANCE_INTERVAL_HOURS |
168 |
Cycle interval (default 7 days) |
CURATOR_GOVERNANCE_MODE |
normal |
normal or team (team adds full audit trail) |
CURATOR_GOVERNANCE_MAX_PROACTIVE |
5 |
Max proactive search queries per cycle |
CURATOR_GOVERNANCE_SYNC_BUDGET |
0 |
Sync queries before async (0 = fully async) |
CURATOR_GOVERNANCE_LOOKBACK_DAYS |
30 |
Query log analysis window |
CURATOR_GOVERNANCE_DRY_RUN |
0 |
1 to skip writes (audit only) |
CURATOR_GOVERNANCE_REPLACES_STRENGTHEN |
0 |
1 to skip standalone strengthen when governance is on |
Other
| Variable | Default | Description |
|---|---|---|
CURATOR_ASYNC_INGEST |
0 |
1 = fire-and-forget background ingest |
CURATOR_CONFLICT_STRATEGY |
auto |
auto / local / external / human |
CURATOR_CB_ENABLED |
1 |
Circuit breaker (0 to disable) |
CURATOR_CACHE_ENABLED |
0 |
Search result cache |
CURATOR_FEEDBACK_WEIGHT |
0.10 |
Feedback score adjustment (max delta) |
CURATOR_JSON_LOGGING |
0 |
1 = JSON structured log output |
CURATOR_CHAT_RETRY_MAX |
3 |
LLM retry attempts |
Maintenance
# Weak topic analysis
python3 scripts/analyze_weak.py --top 10
# Proactive strengthening
python3 scripts/strengthen.py --top 5
# Freshness scan
python3 scripts/freshness_scan.py --limit 50 # URL reachability
python3 scripts/freshness_scan.py --act # Auto-refresh stale
# TTL rebalance
python3 scripts/ttl_rebalance.py # Report
python3 scripts/ttl_rebalance.py --json # JSON export
# Async job management
python3 scripts/async_job_cli.py list # Overview
python3 scripts/async_job_cli.py list --failed # Failed jobs
python3 scripts/async_job_cli.py replay <job_id> # Re-queue a job
# Governance (automated knowledge maintenance)
python3 -m curator.governance_cli report # View latest report
python3 -m curator.governance_cli report --format json
python3 -m curator.governance_cli report --format html > report.html
python3 -m curator.governance_cli flags # Pending flags
python3 -m curator.governance_cli flags --all # All flags
python3 -m curator.governance_cli show <flag_id> # Flag details
python3 -m curator.governance_cli keep <flag_id> # Mark: keep resource
python3 -m curator.governance_cli delete <flag_id> # Mark: approve deletion
python3 -m curator.governance_cli adjust <flag_id> # Mark: needs adjustment
python3 -m curator.governance_cli ignore <flag_id> # Mark: ignore this flag
python3 -m curator.governance_cli run # Trigger full cycle
python3 -m curator.governance_cli run --dry # Dry run (no writes)
python3 -m curator.governance_cli run --mode team # Team mode (full audit)
Or enable the background scheduler (CURATOR_SCHEDULER_ENABLED=1) to run freshness scans and strengthening automatically. Add CURATOR_GOVERNANCE_ENABLED=1 for automated governance cycles.
Project structure
curator/
pipeline_v2.py # Main pipeline orchestrator
config.py # Config + HTTP client with retry
settings.py # Pydantic Settings v2 (typed, validated)
backend.py # KnowledgeBackend ABC
backend_ov.py # OpenViking backend (embedded + HTTP)
backend_memory.py # In-memory backend (testing)
session_manager.py # Dual-mode OV client
retrieval_v2.py # L0→L1→L2 retrieval + coverage
search.py # External search + cross-validation
search_providers.py # Pluggable provider registry
review.py # LLM judge + ingest + conflict
router.py # Rule-based routing (JSON config)
freshness.py # URI time-decay scoring
usage_ttl.py # Usage-based TTL tiers
dedup.py # Duplicate scanning
decision_report.py # ASCII / JSON / HTML reports
feedback_store.py # Up/down/adopt feedback
domain_filter.py # Domain whitelist/blacklist
circuit_breaker.py # 3-state circuit breaker
search_cache.py # LRU + dual-TTL cache
async_jobs.py # Background job tracking
governance.py # Automated governance cycle (6 phases)
governance_cli.py # Governance CLI (report, flags, run)
governance_report.py # Governance report (ASCII/JSON/HTML)
interest_analyzer.py # User interest extraction + proactive queries
nlp_utils.py # Topic extraction + keyword utils
scheduler.py # APScheduler periodic jobs
logging_setup.py # structlog configuration
file_lock.py # Shared flock utilities
legacy/ # Archived v1
curator_query.py # CLI entry point
mcp_server.py # MCP server (stdio JSON-RPC)
scripts/ # Maintenance scripts
tests/ # 554 tests
Testing
# All tests (uses InMemoryBackend, no OV dependency)
uv run pytest tests/ -v
# Single file
uv run pytest tests/test_core.py -v
# Type checking
uv run mypy curator/ --ignore-missing-imports --exclude curator/legacy/
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
Missing required env vars |
.env not configured |
Fill in CURATOR_OAI_BASE, CURATOR_OAI_KEY, CURATOR_GROK_KEY |
OV not available |
OpenViking not reachable | Check OPENVIKING_CONFIG_FILE (embedded) or OV_BASE_URL (HTTP) |
401 Unauthorized |
Wrong API key | Check keys in .env |
| Timeout on search | Endpoint unreachable | Check URL and service status |
| Coverage always 0.0 | OV is empty | Ingest some content first, or lower CURATOR_THRESHOLD_COV_SUFFICIENT |
| External always triggered | Thresholds too high | Lower coverage thresholds in .env |
| Judge returns low trust | Weak LLM model | Try a stronger model in CURATOR_JUDGE_MODELS |
Roadmap
- KnowledgeBackend abstraction (OV-agnostic interface)
- Conflict detection + bidirectional resolution
- Pydantic-validated judge output
- Quality feedback loop (feedback → retrieval ranking)
- Enhanced dedup (URL hash + Jaccard)
- Decision report (ASCII + JSON + HTML)
- Async ingest with job tracking + recovery CLI
- Auto-generate L0/L1 summaries on ingest
- Multi-provider search (Grok + DuckDuckGo + Tavily)
- Domain filtering (whitelist / blacklist)
- Usage-based TTL (hot / warm / cold tiers)
- Circuit breaker + search cache
- Structured logging (structlog + JSON mode)
- Background scheduler (freshness + strengthen)
- Docker support (embedded + HTTP OV modes)
- mypy + pre-commit (ruff + ruff-format)
- uv dependency management
- Automated governance (audit, flag, proactive search, report)
- Interest-based proactive search (query log + feedback analysis)
- Hybrid async governance (sync budget + background thread + trace harvest)
- Coverage auto-tuning (dynamic thresholds from query log)
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openviking_curator-0.7.0.tar.gz.
File metadata
- Download URL: openviking_curator-0.7.0.tar.gz
- Upload date:
- Size: 197.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
887fd7a1fb3bdac771b18244052189a1d268ea76deff07b4e8f75b0a12cb851b
|
|
| MD5 |
4b0dfd4758b6305110869dc94dc1c561
|
|
| BLAKE2b-256 |
ffa19a95696b86f2a25a2e827ca316b0f6592e989346624f8c7c968931e466ab
|
File details
Details for the file openviking_curator-0.7.0-py3-none-any.whl.
File metadata
- Download URL: openviking_curator-0.7.0-py3-none-any.whl
- Upload date:
- Size: 148.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d82d0729b08edc424d8006875db0a3f3959e9952eaa6fd354643bd50d5c5b978
|
|
| MD5 |
161918369725a66ed433df661e216ff0
|
|
| BLAKE2b-256 |
2a5a8842db4e43e1142e1dfdece727fbd6a42dedb4bc1bbcc8cf9359df5f2a07
|