Skip to main content

Knowledge governance plugin for OpenViking — coverage assessment, external search, review, ingest, dedup, and feedback-driven ranking.

Project description

OpenViking Curator

English / 中文

Knowledge governance plugin for OpenViking. Curator sits on top of your OV knowledge base — it decides when local knowledge is enough, when to search externally, reviews what comes back, and ingests the good stuff. Your knowledge base grows with every question.

CI License: MIT Python 3.10+

How it works

flowchart TD
    Q[Query] --> R[Route]
    R --> OV[Retrieve from OV]
    OV --> L[Load L0 → L1 → L2]
    L --> COV{Coverage OK?}
    COV -- yes --> OUT1[Return local context]
    COV -- no --> EXT[External search]
    EXT --> F{Need fresh?}
    F -- yes --> CV[Cross-validate]
    F -- no --> J
    CV --> J[Judge + conflict]
    J --> P{Pass?}
    P -- no --> OUT1
    P -- yes --> C{Conflict?}
    C -- blocked --> OUT1
    C -- ok --> ING[Ingest + verify]
    ING --> OUT2[Return merged context]

LLM call strategy:

  • Coverage sufficient → 0 LLM calls, return immediately
  • External search triggered → 1 LLM call (judge + conflict combined)
  • Need freshness validation → 2 LLM calls (+ cross-validate)

Features

Feature Description Module
Rule-based routing Domain, keywords, freshness detection. No LLM needed. JSON-configurable. router.py
Dual-path retrieval find (vector) + search (LLM intent). URI dedup. retrieval_v2.py
On-demand loading L0 (abstract) → L1 (overview) → L2 (full). Only deeper when needed. retrieval_v2.py
Coverage assessment Score gap + keyword overlap signals. 0 LLM calls. retrieval_v2.py
External search Pluggable providers: Grok, DuckDuckGo, Tavily. Fallback chain or concurrent. search_providers.py
Domain filtering Whitelist / blacklist external search results by domain. domain_filter.py
Cross-validation When need_fresh=true. Flags risky/outdated claims. search.py
Judge + conflict Single LLM call: trust 0-10, freshness, pass/fail, contradiction. Pydantic validated. review.py
Conflict resolution Configurable: auto / local / external / human. Bidirectional scoring. pipeline_v2.py
Ingest Writes to OV with metadata (source URLs, version, TTL, quality feedback). review.py
Async ingest Fire-and-forget background thread. Observable via async job tracker. pipeline_v2.py + async_jobs.py
Auto-summarize Generates L0/L1 summaries on ingest for new resources. pipeline_v2.py
Dedup scanning URL hash (exact match) + Jaccard word similarity. Reports only. dedup.py
Freshness scoring URI timestamp → decay score. Configurable thresholds. freshness.py
Usage-based TTL Hot / warm / cold tiers. Frequently used → longer TTL. usage_ttl.py
Feedback reranking up/down/adopt per URI. Score-weighted, rank-aware. OV score stays dominant. feedback_store.py
Decision report ASCII box, single-line, JSON, HTML export. Always present in run() output. decision_report.py
Session tracking Records queries + used URIs. Commits to extract long-term memory. session_manager.py
Query logging Every query → query_log.jsonl with coverage, reasons, LLM calls. pipeline_v2.py
Circuit breaker 3-state breaker wrapping LLM + search calls. Auto-recovery. circuit_breaker.py
Search cache LRU + dual TTL. File-locked JSON persistence. search_cache.py
Automated governance Weekly cycle: audit, flag, proactive search, report. Hybrid async. No auto-deletion. governance.py
Interest analysis Extract user interests from query log + feedback. Generate proactive search queries. interest_analyzer.py
Background scheduler APScheduler: periodic freshness scan + weak topic strengthening + governance. scheduler.py
Structured logging structlog with JSON mode. Per-run context binding (run_id, query). logging_setup.py

What Curator does NOT do

  • Vector search / indexing → OpenViking handles this
  • Answer generation → your LLM; Curator returns structured context, not answers

Quick Start

Prerequisites

  • Python 3.10+
  • A working OpenViking setup (embedded or HTTP mode)
  • An OpenAI-compatible API endpoint (for LLM review)
  • A search API (Grok recommended, or DuckDuckGo/Tavily)

Install

git clone https://github.com/ponsde/OpenViking_Curator.git
cd OpenViking_Curator

# Recommended: uv (fast, reproducible)
uv sync
source .venv/bin/activate

# Alternative: pip
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env   # then fill in your keys

Configure

Edit .env:

# OpenViking config (embedded mode)
OPENVIKING_CONFIG_FILE=/path/to/your/ov.conf

# LLM for review & routing (any OpenAI-compatible endpoint)
CURATOR_OAI_BASE=https://your-llm-api.com/v1
CURATOR_OAI_KEY=sk-your-key

# External search (Grok recommended)
CURATOR_GROK_BASE=https://your-grok-endpoint/v1
CURATOR_GROK_KEY=your-grok-key

Three endpoints, three keys. That's it.

Run

python3 curator_query.py --status                         # Health check
python3 curator_query.py "How to deploy Redis in Docker?" # Query
python3 curator_query.py --review "sensitive topic"       # Review mode (no auto-ingest)
python3 mcp_server.py                                     # MCP server (stdio JSON-RPC)

Docker

# Embedded mode: OV runs in-process
cp ov.conf.example ov.conf   # fill in your embedding API keys
cp .env.example .env         # fill in LLM + search keys
docker compose build
docker compose run --rm curator curator_query.py --status
docker compose run --rm curator curator_query.py "your question"

# HTTP mode: OV runs as a separate service
cp ov.conf.example ov.conf   # keep as-is (not read in HTTP mode)
echo "OV_BASE_URL=http://your-ov-host:8080" >> .env
docker compose build
docker compose run --rm curator curator_query.py --status

Python API

from curator.pipeline_v2 import run

result = run("Nginx reverse proxy with SSL?")
print(result["context_text"])         # local context
print(result["external_text"])        # external (if any)
print(result["coverage"])             # 0.0 ~ 1.0
print(result["meta"]["ingested"])     # True if new content stored
print(result["conflict"])             # conflict detection
print(result["decision_report"])      # ASCII decision report

# Reusable pipeline instance (shares session + backend)
from curator.pipeline_v2 import CuratorPipeline
pipeline = CuratorPipeline()
r1 = pipeline.run("how to deploy?")
r2 = pipeline.run("what is RAG?")

# Decision report in other formats
from curator.decision_report import format_report_json, format_report_html
print(format_report_json(result))
print(format_report_html(result))

# Feedback (boosts retrieval ranking next time)
from curator.feedback_store import apply
apply("viking://resources/doc-id", "up")    # mark helpful
apply("viking://resources/doc-id", "down")  # mark unhelpful

Configuration

All via .env (git-ignored). See .env.example for a full template.

Core (required)

Variable Description
OPENVIKING_CONFIG_FILE Path to ov.conf (embedded mode)
CURATOR_OAI_BASE OpenAI-compatible API base URL
CURATOR_OAI_KEY API key for above
CURATOR_GROK_KEY Grok API key (external search)

OV mode

Variable Default Description
OV_BASE_URL (empty) Set to use HTTP mode. Empty = embedded mode.
OV_DATA_PATH ./data OV data directory (embedded mode)

Search

Variable Default Description
CURATOR_SEARCH_PROVIDERS grok Comma-separated: grok,duckduckgo,tavily (fallback chain)
CURATOR_SEARCH_CONCURRENT 0 1 = fire all providers in parallel
CURATOR_SEARCH_TIMEOUT 60 Global search timeout (seconds)
CURATOR_TAVILY_KEY (empty) Tavily API key (if using tavily provider)
CURATOR_ALLOWED_DOMAINS (empty) Whitelist (comma-separated)
CURATOR_BLOCKED_DOMAINS (empty) Blacklist (comma-separated)

Thresholds

Variable Default Effect
CURATOR_THRESHOLD_COV_SUFFICIENT 0.55 Above = skip external search
CURATOR_THRESHOLD_COV_MARGINAL 0.45 Above = marginal (still searches)
CURATOR_THRESHOLD_COV_LOW 0.35 Below = definitely search
CURATOR_THRESHOLD_L0_SUFFICIENT 0.62 L0 score to skip L1
CURATOR_THRESHOLD_L1_SUFFICIENT 0.50 L1 score to skip L2
CURATOR_MAX_L2_DEPTH 2 Max full-text reads per run

Background scheduler

Variable Default Description
CURATOR_SCHEDULER_ENABLED 0 1 to activate background jobs
CURATOR_FRESHNESS_INTERVAL_HOURS 24 Freshness scan interval
CURATOR_STRENGTHEN_INTERVAL_HOURS 168 Weak topic strengthening interval (7 days)
CURATOR_STRENGTHEN_TOP_N 3 Number of weak topics per run

Governance

Variable Default Description
CURATOR_GOVERNANCE_ENABLED 0 1 to enable governance cycle
CURATOR_GOVERNANCE_INTERVAL_HOURS 168 Cycle interval (default 7 days)
CURATOR_GOVERNANCE_MODE normal normal or team (team adds full audit trail)
CURATOR_GOVERNANCE_MAX_PROACTIVE 5 Max proactive search queries per cycle
CURATOR_GOVERNANCE_SYNC_BUDGET 0 Sync queries before async (0 = fully async)
CURATOR_GOVERNANCE_LOOKBACK_DAYS 30 Query log analysis window
CURATOR_GOVERNANCE_DRY_RUN 0 1 to skip writes (audit only)
CURATOR_GOVERNANCE_REPLACES_STRENGTHEN 0 1 to skip standalone strengthen when governance is on

Other

Variable Default Description
CURATOR_ASYNC_INGEST 0 1 = fire-and-forget background ingest
CURATOR_CONFLICT_STRATEGY auto auto / local / external / human
CURATOR_CB_ENABLED 1 Circuit breaker (0 to disable)
CURATOR_CACHE_ENABLED 0 Search result cache
CURATOR_FEEDBACK_WEIGHT 0.10 Feedback score adjustment (max delta)
CURATOR_JSON_LOGGING 0 1 = JSON structured log output
CURATOR_CHAT_RETRY_MAX 3 LLM retry attempts

Maintenance

# Weak topic analysis
python3 scripts/analyze_weak.py --top 10

# Proactive strengthening
python3 scripts/strengthen.py --top 5

# Freshness scan
python3 scripts/freshness_scan.py --limit 50       # URL reachability
python3 scripts/freshness_scan.py --act             # Auto-refresh stale

# TTL rebalance
python3 scripts/ttl_rebalance.py                    # Report
python3 scripts/ttl_rebalance.py --json             # JSON export

# Async job management
python3 scripts/async_job_cli.py list               # Overview
python3 scripts/async_job_cli.py list --failed      # Failed jobs
python3 scripts/async_job_cli.py replay <job_id>    # Re-queue a job

# Governance (automated knowledge maintenance)
python3 -m curator.governance_cli report             # View latest report
python3 -m curator.governance_cli report --format json
python3 -m curator.governance_cli report --format html > report.html
python3 -m curator.governance_cli flags              # Pending flags
python3 -m curator.governance_cli flags --all        # All flags
python3 -m curator.governance_cli show <flag_id>     # Flag details
python3 -m curator.governance_cli keep <flag_id>     # Mark: keep resource
python3 -m curator.governance_cli delete <flag_id>   # Mark: approve deletion
python3 -m curator.governance_cli adjust <flag_id>   # Mark: needs adjustment
python3 -m curator.governance_cli ignore <flag_id>   # Mark: ignore this flag
python3 -m curator.governance_cli run                # Trigger full cycle
python3 -m curator.governance_cli run --dry          # Dry run (no writes)
python3 -m curator.governance_cli run --mode team    # Team mode (full audit)

Or enable the background scheduler (CURATOR_SCHEDULER_ENABLED=1) to run freshness scans and strengthening automatically. Add CURATOR_GOVERNANCE_ENABLED=1 for automated governance cycles.

Project structure

curator/
  pipeline_v2.py       # Main pipeline orchestrator
  config.py            # Config + HTTP client with retry
  settings.py          # Pydantic Settings v2 (typed, validated)
  backend.py           # KnowledgeBackend ABC
  backend_ov.py        # OpenViking backend (embedded + HTTP)
  backend_memory.py    # In-memory backend (testing)
  session_manager.py   # Dual-mode OV client
  retrieval_v2.py      # L0→L1→L2 retrieval + coverage
  search.py            # External search + cross-validation
  search_providers.py  # Pluggable provider registry
  review.py            # LLM judge + ingest + conflict
  router.py            # Rule-based routing (JSON config)
  freshness.py         # URI time-decay scoring
  usage_ttl.py         # Usage-based TTL tiers
  dedup.py             # Duplicate scanning
  decision_report.py   # ASCII / JSON / HTML reports
  feedback_store.py    # Up/down/adopt feedback
  domain_filter.py     # Domain whitelist/blacklist
  circuit_breaker.py   # 3-state circuit breaker
  search_cache.py      # LRU + dual-TTL cache
  async_jobs.py        # Background job tracking
  governance.py        # Automated governance cycle (6 phases)
  governance_cli.py    # Governance CLI (report, flags, run)
  governance_report.py # Governance report (ASCII/JSON/HTML)
  interest_analyzer.py # User interest extraction + proactive queries
  nlp_utils.py         # Topic extraction + keyword utils
  scheduler.py         # APScheduler periodic jobs
  logging_setup.py     # structlog configuration
  file_lock.py         # Shared flock utilities
  legacy/              # Archived v1
curator_query.py       # CLI entry point
mcp_server.py          # MCP server (stdio JSON-RPC)
scripts/               # Maintenance scripts
tests/                 # 554 tests

Testing

# All tests (uses InMemoryBackend, no OV dependency)
uv run pytest tests/ -v

# Single file
uv run pytest tests/test_core.py -v

# Type checking
uv run mypy curator/ --ignore-missing-imports --exclude curator/legacy/

Troubleshooting

Problem Cause Fix
Missing required env vars .env not configured Fill in CURATOR_OAI_BASE, CURATOR_OAI_KEY, CURATOR_GROK_KEY
OV not available OpenViking not reachable Check OPENVIKING_CONFIG_FILE (embedded) or OV_BASE_URL (HTTP)
401 Unauthorized Wrong API key Check keys in .env
Timeout on search Endpoint unreachable Check URL and service status
Coverage always 0.0 OV is empty Ingest some content first, or lower CURATOR_THRESHOLD_COV_SUFFICIENT
External always triggered Thresholds too high Lower coverage thresholds in .env
Judge returns low trust Weak LLM model Try a stronger model in CURATOR_JUDGE_MODELS

Roadmap

  • KnowledgeBackend abstraction (OV-agnostic interface)
  • Conflict detection + bidirectional resolution
  • Pydantic-validated judge output
  • Quality feedback loop (feedback → retrieval ranking)
  • Enhanced dedup (URL hash + Jaccard)
  • Decision report (ASCII + JSON + HTML)
  • Async ingest with job tracking + recovery CLI
  • Auto-generate L0/L1 summaries on ingest
  • Multi-provider search (Grok + DuckDuckGo + Tavily)
  • Domain filtering (whitelist / blacklist)
  • Usage-based TTL (hot / warm / cold tiers)
  • Circuit breaker + search cache
  • Structured logging (structlog + JSON mode)
  • Background scheduler (freshness + strengthen)
  • Docker support (embedded + HTTP OV modes)
  • mypy + pre-commit (ruff + ruff-format)
  • uv dependency management
  • Automated governance (audit, flag, proactive search, report)
  • Interest-based proactive search (query log + feedback analysis)
  • Hybrid async governance (sync budget + background thread + trace harvest)
  • Coverage auto-tuning (dynamic thresholds from query log)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openviking_curator-0.7.0.tar.gz (197.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openviking_curator-0.7.0-py3-none-any.whl (148.3 kB view details)

Uploaded Python 3

File details

Details for the file openviking_curator-0.7.0.tar.gz.

File metadata

  • Download URL: openviking_curator-0.7.0.tar.gz
  • Upload date:
  • Size: 197.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for openviking_curator-0.7.0.tar.gz
Algorithm Hash digest
SHA256 887fd7a1fb3bdac771b18244052189a1d268ea76deff07b4e8f75b0a12cb851b
MD5 4b0dfd4758b6305110869dc94dc1c561
BLAKE2b-256 ffa19a95696b86f2a25a2e827ca316b0f6592e989346624f8c7c968931e466ab

See more details on using hashes here.

File details

Details for the file openviking_curator-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for openviking_curator-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d82d0729b08edc424d8006875db0a3f3959e9952eaa6fd354643bd50d5c5b978
MD5 161918369725a66ed433df661e216ff0
BLAKE2b-256 2a5a8842db4e43e1142e1dfdece727fbd6a42dedb4bc1bbcc8cf9359df5f2a07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page