Skip to main content

Benchmark 9 retrieval architectures on your documentation — find which KB architecture fits your data

Project description

KB Arena

Should you use Graph RAG, Vector RAG, or Hybrid? KB Arena tells you — empirically, on your own docs.

Eight retrieval architectures. Your documentation. One winner.

KB Arena is the only open-source benchmark that runs architecturally distinct retrieval strategies — naive vector, contextual vector, Q&A pairs, knowledge graph, hybrid (RRF-fused), RAPTOR, PageIndex, BM25, and rerank-vector (cross-encoder reranking) — head-to-head on your own corpus, with auto-generated questions across 5 difficulty tiers, IR metrics (Recall@k, MRR, NDCG@k), RAGAS metrics, ELO arena voting, a CI gate, and a strategy plugin system.

Embeddings: pluggable across OpenAI, Voyage-3, Cohere, Gemini, BGE (local), Ollama (local) via KB_ARENA_EMBEDDING_PROVIDER. Rerankers: BGE-v2-m3 (local), Cohere Rerank, Voyage Rerank via KB_ARENA_RERANKER_BACKEND.

Python 3.11+ Pydantic v2 Tests License PyPI CI

KB Arena Demo


Try It in 10 Seconds (no API keys)

pip install kb-arena
kb-arena demo

This launches the dashboard with pre-computed results from the AWS Compute corpus (75 questions, 9 strategies, 5 difficulty tiers). The demo runs in read-only mode — chat, arena, and tools endpoints stay disabled until you set an API key. No Docker, no Neo4j, no surprises.

Dashboard walkthrough

To enable live chat / arena voting / tools, set KB_ARENA_ANTHROPIC_API_KEY (or KB_ARENA_OPENAI_API_KEY, or use KB_ARENA_LLM_PROVIDER=ollama for free local inference).

No-API-Keys Quick Start (Ollama)

# Free local inference — no Anthropic/OpenAI keys needed
ollama pull llama3.1:8b
export KB_ARENA_LLM_PROVIDER=ollama
kb-arena init-corpus my-docs && cp ~/my-docs/*.md datasets/my-docs/raw/
kb-arena run --corpus my-docs        # one command, all stages, resumable
kb-arena serve

How KB Arena Differs from Other RAG Evaluation Tools

Most RAG evaluation tools answer "how well does my pipeline work?" KB Arena answers a different question: "which retrieval architecture works best for my docs?"

KB Arena RAGAS MTEB / BEIR GraphRAG DeepEval
Compares multiple architectures Yes - 8 strategies No - evaluates your existing pipeline No - compares embedding models No - only their own approach No
Works on your own docs Yes Yes No - fixed public datasets No - fixed datasets Yes
Includes graph + vector + hybrid Yes Vector/hybrid only Embeddings only Graph only Any
Auto-generates benchmark questions Yes - 5 difficulty tiers Manual Fixed Fixed Manual
Interactive comparison UI Yes - chatbot + benchmark explorer No Leaderboard only No Dashboard
Chatbot per strategy Yes No No No No
Standard IR metrics (NDCG, MRR) Yes - v0.5.0 Retriever Lab Yes Yes Partial No

If you want to know whether a knowledge graph, Q&A pairs, or plain vector search is the right architecture for your documentation, that's what KB Arena is for.


What's New in v0.5.0 — Retriever Lab

Classical IR metrics computed at the chunk level. See exactly which chunks each strategy surfaced, which it missed, and why one strategy beats another at a metric level — not just at the answer level.

Retriever Lab Demo

Metrics

Recall@k, Precision@k, Hit@k, MRR, NDCG@k — computed for every benchmark query, aggregated per strategy, rendered in the Markdown report.

kb-arena retriever-lab

Retrieval-only benchmark. Skips LLM generation, runs ~10x cheaper than kb-arena benchmark. Streams a live Rich table of metrics as each strategy completes. Writes per-question chunk-level results to results/run_{id}/retriever_lab.json.

kb-arena label-chunks --corpus aws-compute     # Generate ground truth (BM25 + Haiku judge)
kb-arena retriever-lab --corpus aws-compute    # Live IR metrics, no LLM cost

/retriever-lab web page

Aggregate metrics card per strategy plus per-question drill-down. Click a question, see the chunks each strategy surfaced with rank, score, and HIT/MISS badges so you can tell at a glance where retrieval breaks down.

Retriever Lab UI

Real numbers — aws-compute corpus, run 855aac4e

35 of 75 questions have chunk-level ground truth (the corpus only covers Lambda, API Gateway, ECS Fargate; the other 40 questions reference services not in the demo corpus, so their metrics fall to 0 — a useful coverage signal in itself).

Strategy Recall@5 Precision@5 Hit@5 MRR NDCG@5
contextual_vector 35.5% 24.5% 46.7% 0.433 0.388
naive_vector 35.2% 23.2% 46.7% 0.414 0.367
raptor 35.2% 23.2% 46.7% 0.414 0.367
bm25 27.5% 17.1% 44.0% 0.352 0.278
hybrid 8.0% 4.8% 9.3% 0.093 0.086
pageindex 6.1% 5.0% 14.7% 0.111 0.076
qna_pairs 0.0% 0.0% 0.0% 0.000 0.000
knowledge_graph 0.0% 0.0% 0.0% 0.000 0.000

Contextual Vector edges out Naive Vector on ranking quality (MRR / NDCG) thanks to heading-path prefixes; Hybrid drops because the knowledge_graph leg is mocked when Neo4j isn't connected; QnA Pairs operates on Q-A identity, not section identity, so it needs doc-level labels (see docs/retriever-lab.md for interpretation).

Roadmap

  • v1.1: reranker comparison (cross-encoder vs. cohere-rerank vs. bge-reranker)

What's New in v0.4.0

RAGAS Metrics

Industry-standard evaluation metrics alongside the existing LLM judge. Enable with --ragas or KB_ARENA_BENCHMARK_ENABLE_RAGAS=true.

RAGAS Metrics

Adds four metrics per question: faithfulness (answer grounded in context), context precision (retrieved chunks are relevant), context recall (context covers the reference), and answer relevancy (answer addresses the question).

Reference-Free Evaluation

Benchmark without pre-written ground truth -- useful for quick evaluation of new corpora before investing in question generation.

Reference-Free Evaluation

Scores on faithfulness and answer relevancy only (no accuracy/completeness since there's no reference to compare against).

Strategy Plugin System

Bring your own retrieval strategy without forking. Your module exports a single Strategy subclass with build_index() and query() methods.

Strategy Plugin

CI/CD Eval Command

Gate merges on retrieval quality. Exits non-zero if any strategy falls below thresholds. Pair with --format json for machine-readable output.

CI/CD Eval

Cost Cap

Halt a benchmark run automatically if cumulative cost exceeds your budget. Set via KB_ARENA_BENCHMARK_COST_CAP_USD.

Cost Cap

Dry-Run Cost Estimates

Preview query counts, estimated cost, and estimated time before committing to a full benchmark run.

Dry-Run Estimates

Debug Endpoint

Trace the full retrieval pipeline -- intent classification, retrieved sources, latency breakdown, and cost -- without generating a final answer.

Debug Endpoint

Readiness Probe

The /ready endpoint returns 503 if Neo4j is configured but unreachable. Use as a k8s readiness probe or Docker healthcheck.

Ready Endpoint

Side-by-Side Strategy Comparison

New "Compare" view in the benchmark UI lets you pick two strategies and see tier-by-tier accuracy, latency, and cost differences side by side.

Compare View

Other Reliability Improvements

  • Exponential backoff -- benchmark retries use 1s, 2s, 4s instead of linear 1s, 2s, 3s
  • Embedding retry -- OpenAI embedding API calls retry 3x with exponential backoff and 30s timeout
  • Eval memoization -- identical answer+reference pairs are scored once and cached
  • Arena JSONL -- append-only vote log at results/arena_votes.jsonl survives state resets
  • Corpus validation -- tightened from denylist to regex allowlist ^[a-zA-Z0-9_-]+$

What's New in v0.3.1

Production Hardening

Session management, error handling, and API configuration improvements for real deployments:

  • Session ID support -- pass X-Session-ID header instead of relying on IP-based sessions. Fixes shared proxy and network-switching issues.
  • Session TTL -- idle sessions are automatically evicted (default 30 min, configurable via KB_ARENA_SESSION_TTL_MINUTES)
  • CORS configuration -- set allowed origins via KB_ARENA_CORS_ORIGINS env var instead of hardcoded localhost
  • Corpus validation -- graph build API validates corpus exists with processed documents before starting
  • Specific exception handling -- Neo4j connection errors, graph extraction failures, and stream errors now catch specific types instead of bare except Exception

Streaming Cost Tracking

OpenAI and Ollama providers now capture token usage after streaming completes -- previously only Anthropic tracked streaming costs. The chatbot demo now reports accurate cost_usd for all three providers.

Faster QnA Index Building

Q&A pair generation during build-vectors is now parallelized with asyncio.gather() (5 concurrent). Building QnA indexes on large corpora is up to 5x faster.

Custom Exception Hierarchy

New kb_arena.exceptions module with typed exceptions (IngestError, GraphError, StrategyError, EvaluationError, LLMError) for better error handling and debugging.

Frontend Error Boundary

React error boundary wraps all page content -- API failures and render errors now show a recovery UI instead of a blank page.

Graph Schema Cleanup

Removed dead Cypher templates that referenced non-existent relationship types (DEPRECATED_BY, INHERITS, REQUIRES, EXAMPLE_OF). Remaining templates now use only valid universal schema types.


What's New in v0.3.0

Multi-LLM Provider Support

No longer locked to Anthropic. Choose your LLM backend:

# Anthropic (default)
export KB_ARENA_LLM_PROVIDER=anthropic
export KB_ARENA_ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
export KB_ARENA_LLM_PROVIDER=openai
export KB_ARENA_OPENAI_API_KEY=sk-...

# Ollama (free local inference)
export KB_ARENA_LLM_PROVIDER=ollama

Each provider has its own model mapping -- GPT-4o for generation, GPT-4o-mini for classification when using OpenAI; any local model when using Ollama.

Strategy Arena - Blind A/B Comparison

A new Arena mode for blind head-to-head strategy battles. Ask a question, two random strategies answer it, you vote for the better response. ELO ratings emerge over time.

kb-arena serve  # then open /arena in your browser

Arena Mode

BM25 Baseline Strategy

Strategy #8: classic BM25 keyword matching. The pre-neural baseline that answers "do I even need embeddings for my docs?" Uses BM25Okapi scoring with LLM answer generation.

Parallel Benchmark Execution

Strategies now run concurrently instead of sequentially. A full 8-strategy benchmark that took 60-90 minutes now completes in 15-25 minutes.

kb-arena benchmark --corpus my-docs              # parallel by default
kb-arena benchmark --corpus my-docs --no-parallel # sequential if needed

Accurate Token Counting

Replaced whitespace tokenization with tiktoken (cl100k_base BPE). Chunk sizes are now measured in real tokens, not word counts. Previous "512-token chunks" were actually ~370 real tokens - now they're exactly 512.

Cost Tracking Fixed

Fixed cost propagation across all multi-call strategies:

  • Knowledge graph: Text-to-Cypher generation cost now tracked
  • PageIndex: beam traversal LLM cost now accumulated
  • Streaming: token usage captured via get_final_message()
  • Latency decomposition: retrieval vs generation timing now measured separately

Run Comparison

Benchmark runs now have unique IDs and timestamps. Results are preserved across runs instead of overwritten:

kb-arena benchmark --corpus my-docs
# Run ID: a1b2c3d4
# Results: results/run_a1b2c3d4/my-docs_naive_vector.json

kb-arena report --run-id a1b2c3d4

CI/CD Integration

Fail your pipeline if retrieval quality drops:

kb-arena benchmark --corpus my-docs --fail-below 0.7
# Exit code 1 if any strategy's accuracy falls below 70%

Export Formats

Generate reports in CSV or self-contained HTML:

kb-arena report --format csv    # spreadsheet-ready
kb-arena report --format html   # shareable dashboard

Bundled Frontend

kb-arena demo now serves a complete dashboard -- no separate Next.js dev server needed. The static frontend is bundled with the pip package.

Changelog

Version Date Changes
0.5.0 2026-04-26 Retriever Lab — classical IR metrics (Recall@k, Precision@k, Hit@k, MRR, NDCG@k) computed per query, RetrievalTrace exposes retrieved chunks per strategy with rank+score, kb-arena retriever-lab retrieval-only command (~10x cheaper than benchmark), kb-arena label-chunks BM25+Haiku ground-truth generator, --top-k flag on benchmark, /retriever-lab web page with HIT/MISS drill-down, hierarchical chunk-id matching, 558 tests
0.4.0 2026-04-04 RAGAS metrics (faithfulness, context precision/recall, answer relevancy), reference-free eval mode, LLM eval memoization, cost cap enforcement, strategy plugin system (--strategy-module), CI/CD eval command (kb-arena eval --ci), debug/explain endpoint, /ready health probe, exponential backoff on retries, embedding API retry+timeout, arena ELO JSONL persistence, side-by-side strategy comparison UI, benchmark dry-run cost estimates, tightened corpus validation
0.3.1 2026-03-26 Production hardening (session IDs, TTL, CORS config, corpus validation), streaming cost for OpenAI/Ollama, parallel QnA build, custom exceptions, error boundary, graph schema cleanup, 494 tests
0.3.0 2026-03-20 Multi-LLM providers (Anthropic/OpenAI/Ollama), Strategy Arena, BM25 strategy, parallel benchmarks, tiktoken chunking, cost tracking fixes, run comparison, CI fail-below, CSV/HTML export, bundled frontend
0.2.1 2026-03-03 PageIndex strategy, verbose mode, retry logic, parallel extraction, eval independence
0.2.0 2026-02-28 RAPTOR strategy, hybrid improvements, 7 strategies total
0.1.0 2026-02-20 Initial release: 5 strategies, benchmark engine, chatbot demo

Quick Start -- I Just Have My Docs

You have documentation files (markdown, HTML, text, PDFs). You want to know which retrieval strategy works best. Here's everything from zero.

Prerequisites

  1. Python 3.11+ and pip
  2. Docker (for Neo4j — the knowledge graph strategy needs it)
  3. API keys for your LLM provider (Anthropic, OpenAI, or Ollama) and OpenAI (embeddings)

That's it. No Neo4j expertise needed. No graph database experience required. KB Arena handles the schema, extraction, and querying.

Step 1: Install

pip install kb-arena

# Optional: install format-specific parsers
pip install kb-arena[pdf]        # PDF support (PyMuPDF)
pip install kb-arena[docx]       # Word documents (mammoth)
pip install kb-arena[web]        # Web scraping (httpx)
pip install kb-arena[all-formats] # All of the above

Step 2: Set API keys

Create a .env file or export directly:

export KB_ARENA_ANTHROPIC_API_KEY=sk-ant-...    # Claude for generation + evaluation
export KB_ARENA_OPENAI_API_KEY=sk-...           # OpenAI for text-embedding-3-large

Step 3: Start Neo4j

KB Arena uses Neo4j for the knowledge graph strategy. One command:

docker compose up neo4j -d

This starts Neo4j on localhost:7687 with default credentials (neo4j / kbarena1). No configuration needed — KB Arena creates the schema automatically.

If you don't have the docker-compose.yml, create one:

services:
  neo4j:
    image: neo4j:5-community
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/kbarena1
      - NEO4J_PLUGINS=["apoc"]
    volumes:
      - neo4j_data:/data

volumes:
  neo4j_data:

Don't want Docker? KB Arena still works — the vector strategies, RAPTOR, and PageIndex run without Neo4j. Only the knowledge graph and hybrid strategies need it.

Step 4: Run the pipeline

# Scaffold a new corpus
kb-arena init-corpus my-docs

# Drop your docs into the raw/ directory
cp ~/my-documentation/*.md datasets/my-docs/raw/
# Supports: .md, .html, .txt, .pdf, .docx, .csv, .tsv — auto-detected

# Parse into the unified Document model (JSONL intermediate files)
kb-arena ingest datasets/my-docs/raw/ --corpus my-docs

# Or ingest directly from a URL or GitHub repo
kb-arena ingest https://docs.example.com --corpus my-docs
kb-arena ingest github:owner/repo --corpus my-docs

# Build the knowledge graph in Neo4j (entities + relationships)
kb-arena build-graph --corpus my-docs

# Build vector indexes in ChromaDB (local, no server needed)
kb-arena build-vectors --corpus my-docs

# Auto-generate benchmark questions from your docs (10 per difficulty tier)
kb-arena generate-questions --corpus my-docs --count 50

# Run the benchmark (each question x 8 strategies, 4-pass evaluation)
kb-arena benchmark --corpus my-docs

# Launch the web UI to explore results
kb-arena serve

Open http://localhost:8000 for the API, http://localhost:3000 for the dashboard.

Step 5: Read the results

The benchmark produces:

  • Accuracy by tier — which strategy handles simple lookups vs multi-hop architecture questions
  • Latency percentiles — p50, p95, p99 per strategy
  • Cost per query — token usage and API cost
  • Composite ranking — 0.5 * accuracy + 0.3 * reliability + 0.2 * latency

Results are saved to results/ as JSON and displayed in the web dashboard.


Full Stack with Docker Compose

Run everything — Neo4j, the API server, and the frontend — in one command:

# Set your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

# Start all services
docker compose up -d

# Open the dashboard
open http://localhost:3000

The compose file starts Neo4j (port 7474/7687), the FastAPI backend (port 8000), and the Next.js frontend (port 3000).


Using the Built-in AWS Example

The AWS Compute corpus ships ready to use (75 questions across 5 difficulty tiers):

kb-arena ingest ./datasets/aws-compute/raw/ --corpus aws-compute
kb-arena build-graph --corpus aws-compute
kb-arena build-vectors --corpus aws-compute
kb-arena benchmark --corpus aws-compute
kb-arena label-chunks --corpus aws-compute        # v0.5.0: ground truth for IR metrics
kb-arena retriever-lab --corpus aws-compute       # v0.5.0: classical IR metrics, no LLM cost
kb-arena serve

Screenshots

Home — Overview of the 8 strategies, difficulty tiers, and evaluation methodology.

Home page

Strategy comparison — Ask the same question to all 8 strategies simultaneously. Compare answers, sources, latency, and cost side-by-side.

Strategy comparison demo

Benchmark results — Accuracy table by tier with grouped bar chart.

Benchmark results

Knowledge graph — Interactive force-directed visualization of entities extracted from your docs.

Knowledge graph viewer

Live graph build — Watch entities and relationships stream in as the extractor runs.

Live graph animation


Documentation Tools

Beyond benchmarking, KB Arena includes three standalone tools that work on any documentation corpus.

All three tools are available as CLI commands and through the web UI at /tools.

Q&A Generator

Generate Q&A pairs from your docs — use them for chatbot training, FAQ pages, or search indexes. Only needs an Anthropic key (no embeddings, no vector DB).

kb-arena generate-qa --corpus my-docs
# Outputs: datasets/my-docs/qa-pairs/qa_pairs.jsonl

CLI

Q&A Generator CLI

Web UI

Q&A Generator Web UI

Docs Gap Analyzer

Find what's missing in your documentation before your users complain about it. Generates Q&A pairs per section, self-evaluates them, and classifies each section as strong (>=70%), weak (30-70%), or gap (<30%).

kb-arena audit --corpus my-docs

CLI

Docs Audit CLI

Web UI

Docs Audit Web UI

Fix My Docs

Get actionable recommendations with draft content to improve your docs. Runs the audit internally, then generates prioritized fixes for weak and gap sections.

kb-arena fix --corpus my-docs --max-fixes 5

CLI

Fix My Docs CLI

Web UI

Fix My Docs Web UI

Pipeline: generate-qaauditfix — each command builds on the previous. Or run them independently via CLI or the web UI.


Benchmark Results (AWS Compute Corpus)

Real numbers from 75 questions across 5 difficulty tiers, evaluated with a 4-pass system (structural checks + LLM-as-judge):

Strategy Overall T1 Lookup T2 How-To T3 Comparison T4 Integration T5 Architecture Avg Latency Cost
Q&A Pairs 79.2% 79% 85% 83% 84% 66% 9.0s $0.48
Knowledge Graph 71.5% 72% 69% 61% 77% 79% 20.3s $1.37
Hybrid 64.7% 39% 81% 61% 80% 62% 41.5s $3.02
RAPTOR 25.3% 30% 16% 15% 36% 30% 7.2s $0.69
Naive Vector 20.7% 27% 15% 14% 26% 22% 6.4s $0.33
Contextual Vector 16.5% 25% 11% 9% 26% 11% 5.1s $0.29
BM25 14.0% 24% 11% 9% 16% 10% 4.5s $0.26
PageIndex 14.3% 19% 12% 7% 21% 12% 10.9s $0.29

Key takeaway: Q&A pairs dominate overall because pre-generated answers sidestep retrieval failures. Knowledge graph leads on architecture questions (T5: 79%) where structured graph traversal shines. The hybrid strategy adapts per question type but pays a latency/cost premium. BM25 -- the pre-neural keyword baseline -- scores 14.0% overall at the lowest cost ($0.26), confirming that embeddings add meaningful value for this corpus. PageIndex -- the vectorless, reasoning-based approach -- scores comparably to contextual vector at $0.29, demonstrating that LLM tree traversal is a viable alternative to embeddings on well-structured docs. RAPTOR's hierarchical retrieval shows strength at T4/T5 but needs a larger corpus. Pure vector RAG -- what most teams ship -- scores under 21%. Cost ranges from $0.26 (BM25) to $3.02 (hybrid) for the full 75-question benchmark.

These are results from the built-in AWS Compute corpus. Your mileage will vary — that's the whole point of running it on your own docs.


The 8 Strategies

# Strategy How it works Best at
1 Naive Vector Chunk → embed → cosine similarity → generate Fast lookups, simple factoid questions
2 Contextual Vector Chunk + parent context → embed → rank Disambiguating domain-specific terms
3 Q&A Pairs LLM pre-generates Q&A at index time → match Common questions with known answers
4 Knowledge Graph Entities → Neo4j → Cypher templates → generate Multi-hop dependencies, cross-topic queries
5 Hybrid Intent routing → vector or graph or both (RRF) Adapts per question type
6 RAPTOR Cluster chunks → LLM topic summaries → recursive tree → query all levels Cross-document synthesis, broad topic questions
7 PageIndex Build tree index from doc structure → LLM beam search traversal → no vectors Well-structured docs, reasoning over hierarchy
8 BM25 Classic keyword matching (BM25Okapi) → LLM generation Pre-neural baseline — "do I even need embeddings?"

Question Tiers

Questions are organized into 5 difficulty tiers:

Tier Type Hops What it tests
1 Lookup 1 Single-fact lookup from one document
2 How-To 1-2 Multi-step processes, configuration sequences
3 Comparison 2-3 Comparing alternatives, trade-offs between options
4 Integration 3-4 Dependencies and connections between concepts
5 Architecture 3-5 Cross-document synthesis, transitive reasoning

Use kb-arena generate-questions to auto-generate questions from your docs, or write them by hand in YAML.


Supported Formats

Format Extensions / Input Optional Dep Notes
Markdown .md, .markdown, .rst Heading hierarchy, code blocks, tables
HTML .html, .htm Strips nav/footer, extracts structure
Plain text .txt, .text ALL CAPS heading detection
PDF .pdf kb-arena[pdf] Font-size heading detection, table extraction
Word .docx kb-arena[docx] Converts to HTML, then extracts structure
CSV / TSV .csv, .tsv Auto-detects delimiter, groups rows into sections
Web URL https://... kb-arena[web] Crawls same-domain pages; uses /llms.txt if available
GitHub github:owner/repo Shallow clones and ingests all doc files
SEC EDGAR --format sec-edgar 10-K/10-Q filing parser

Universal Documentation Schema

KB Arena extracts entities and relationships using a universal schema that works for any documentation domain:

5 node types: Topic, Component, Process, Config, Constraint 7 relationship types: DEPENDS_ON, CONTAINS, CONNECTS_TO, TRIGGERS, CONFIGURES, ALTERNATIVE_TO, EXTENDS

No per-domain configuration needed. The LLM maps your documentation concepts to these types automatically.


CLI Reference

Command Description
demo Launch dashboard with pre-computed results (no API keys needed)
init-corpus <name> Scaffold datasets/{name}/ directories
ingest <path> Parse docs into JSONL. Accepts files, dirs, URLs, github:owner/repo. Options: --corpus, --format, --dry-run
build-graph Extract entities/rels into Neo4j. Options: --corpus
build-vectors Build vector indexes + PageIndex tree. Options: --corpus, --strategy
generate-questions Auto-generate benchmark questions. Options: --corpus, --count
benchmark Run evaluation. Options: --corpus, --strategy, --tier, --dry-run
generate-qa Generate Q&A pairs from your docs as JSONL. Options: --corpus, --output
audit Find documentation gaps — classifies sections as strong/weak/gap. Options: --corpus, --output, --max-sections
fix Generate fix recommendations with draft content. Options: --corpus, --max-fixes, --output
report Generate report. Options: --corpus, --output, --format (rich/json)
serve Launch API + frontend. Options: --host, --port
health Pipeline status. Options: --format (rich/json)

All commands are independently re-runnable. Each stage writes to disk so you can re-run any step without repeating earlier ones.

CLI Features

Dry run — Preview what a command will do before committing to expensive LLM calls:

kb-arena ingest datasets/my-docs/raw/ --corpus my-docs --dry-run
# Shows: file count by extension, parser assignment, output path

kb-arena benchmark --corpus my-docs --dry-run
# Shows: question count, strategy list, total queries, concurrency settings

Dry Run Preview

JSON output — Pipe structured data to jq, scripts, or CI pipelines:

kb-arena report --corpus my-docs --format json | jq '.corpora'
kb-arena health --format json | jq '.services'

JSON Output

Pipeline hints — After every command, see what to run next:

$ kb-arena ingest datasets/my-docs/raw/ --corpus my-docs
Done. 12 documents, 47 sections → datasets/my-docs/processed/documents.jsonl

Next: kb-arena build-graph --corpus my-docs && kb-arena build-vectors --corpus my-docs

Progress bars — Every long-running command shows real-time progress (extraction sections, Neo4j batch loading, vector index building, question generation tiers).

Cost tracking — Benchmark runs display cumulative API cost in the progress bar and print per-strategy cost/accuracy summaries after completion.

Verbose mode — Add --verbose / -v to any command for debug logging:

kb-arena benchmark --corpus my-docs --verbose

Reliability and Performance

LLM retry with backoff — All Anthropic API calls retry up to 3 times with exponential backoff (1s, 2s, 4s) on rate limits, timeouts, and server errors. Every call has a 60-second hard timeout.

Parallel graph extraction — Entity/relationship extraction from document sections runs up to 5 sections concurrently (5-10x faster than sequential for large corpora).

Parallel hybrid reranking — The hybrid strategy's passage re-ranking runs all scoring calls concurrently using Haiku instead of sequential Sonnet calls (~50s to ~5s on procedural queries).

Evaluation independence — The LLM-as-judge evaluator uses a different model (JUDGE_MODEL, defaults to Opus) than the generation model (Sonnet) to avoid same-model scoring bias.

Cypher safety — LLM-generated Cypher queries are checked for write operations (CREATE, MERGE, DELETE, etc.) and blocked before execution. Only read queries reach Neo4j.


Environment Variables

All prefixed with KB_ARENA_. Loaded from .env or environment.

Variable Default Required Description
ANTHROPIC_API_KEY Yes Claude for generation, evaluation, extraction
OPENAI_API_KEY Yes OpenAI for text-embedding-3-large
NEO4J_URI bolt://localhost:7687 No Neo4j connection
NEO4J_USER neo4j No Neo4j username
NEO4J_PASSWORD No Neo4j password (set to match NEO4J_AUTH in docker-compose)
JUDGE_MODEL claude-opus-4-6 No Model used for LLM-as-judge evaluation (default differs from generate model to avoid self-evaluation bias)
CHROMA_PATH ./chroma_data No ChromaDB storage path
EMBEDDING_MODEL text-embedding-3-large No OpenAI embedding model
EMBEDDING_DIMENSIONS 3072 No Embedding vector dimensions
GENERATE_MODEL claude-sonnet-4-6 No Generation model
FAST_MODEL claude-haiku-4-5-20251001 No Classification model
HOST 0.0.0.0 No Server bind address
PORT 8000 No Server port
DEBUG false No Debug mode
BENCHMARK_TEMPERATURE 0.0 No LLM temperature for benchmarks
BENCHMARK_MAX_CONCURRENT 5 No Parallel benchmark queries
BENCHMARK_QUERY_TIMEOUT_S 120 No Per-query timeout (seconds)
BENCHMARK_MAX_RETRIES 2 No Retry count on failures
PAGEINDEX_BEAM_WIDTH 3 No Branches to explore per tree level
PAGEINDEX_MAX_DEPTH 4 No Maximum tree traversal depth
DATASETS_PATH ./datasets No Datasets directory
RESULTS_PATH ./results No Results output directory

Development

# Install with dev dependencies
pip install -e '.[dev]'

# Run tests
pytest tests/ -v --ignore=tests/live  # 514 tests

# Lint + format
ruff check . && ruff format --check .

# Frontend
cd web && npm install && npx next build

Tech Stack

Component Technology
LLM Claude Sonnet 4.6 (generation) + Haiku 4.5 (classification)
Embeddings text-embedding-3-large (3072-dim)
Vector store ChromaDB 0.5 (local, no server)
Graph store Neo4j 5 Community
Backend FastAPI + SSE streaming
Frontend Next.js 14 + Tailwind + Recharts
Models Pydantic v2
CLI Typer + Rich
Testing pytest (514 tests)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kb_arena-0.6.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kb_arena-0.6.0-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file kb_arena-0.6.0.tar.gz.

File metadata

  • Download URL: kb_arena-0.6.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for kb_arena-0.6.0.tar.gz
Algorithm Hash digest
SHA256 410397981baa91334ac96aa8f085a415f8e3a37a169f593759daf224cefe1478
MD5 b1342f9b857d1f766d0bc3f664f95a02
BLAKE2b-256 1c18d9bcdd37384e49eab52ba4a692f63205de7abf4657c7d7bec2f7d2d10ff4

See more details on using hashes here.

File details

Details for the file kb_arena-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: kb_arena-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for kb_arena-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64d597e17bf7196f8a3d6e1dbb50d1eb4698f8fa03153b774055db12c019dfa0
MD5 9f5cc6125facb4a89a5a4fae190f3d92
BLAKE2b-256 a4a2200cffa250e244eaafedff62391a77f59e1829b14432bf3e0eb586e955f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page