Benchmark retrieval strategies on your documentation — find which KB architecture fits your data
Project description
KB Arena
Which retrieval architecture works best for your documentation?
KB Arena benchmarks 7 retrieval strategies — naive vector, contextual vector, Q&A pairs, knowledge graph, hybrid, RAPTOR, and PageIndex — on your documentation. Bring your docs in any format, run the pipeline, get empirical results. Ships with an AWS Compute corpus (75 questions across 5 difficulty tiers) as a built-in example.
Try It in 10 Seconds
No API keys. No Docker. Just explore real benchmark results:
pip install kb-arena
kb-arena demo
This launches the dashboard with pre-computed results from the AWS Compute corpus (75 questions, 7 strategies, 5 difficulty tiers).
How KB Arena Differs from Other RAG Evaluation Tools
Most RAG evaluation tools answer "how well does my pipeline work?" KB Arena answers a different question: "which retrieval architecture works best for my docs?"
| KB Arena | RAGAS | MTEB / BEIR | GraphRAG | DeepEval | |
|---|---|---|---|---|---|
| Compares multiple architectures | Yes - 7 strategies | No - evaluates your existing pipeline | No - compares embedding models | No - only their own approach | No |
| Works on your own docs | Yes | Yes | No - fixed public datasets | No - fixed datasets | Yes |
| Includes graph + vector + hybrid | Yes | Vector/hybrid only | Embeddings only | Graph only | Any |
| Auto-generates benchmark questions | Yes - 5 difficulty tiers | Manual | Fixed | Fixed | Manual |
| Interactive comparison UI | Yes - chatbot + benchmark explorer | No | Leaderboard only | No | Dashboard |
| Chatbot per strategy | Yes | No | No | No | No |
| Standard IR metrics (NDCG, MRR) | Roadmap | Yes | Yes | Partial | No |
If you want to know whether a knowledge graph, Q&A pairs, or plain vector search is the right architecture for your documentation, that's what KB Arena is for.
Quick Start — I Just Have My Docs
You have documentation files (markdown, HTML, text, PDFs). You want to know which retrieval strategy works best. Here's everything from zero.
Prerequisites
- Python 3.11+ and pip
- Docker (for Neo4j — the knowledge graph strategy needs it)
- API keys for Anthropic (LLM) and OpenAI (embeddings)
That's it. No Neo4j expertise needed. No graph database experience required. KB Arena handles the schema, extraction, and querying.
Step 1: Install
pip install kb-arena
# Optional: install format-specific parsers
pip install kb-arena[pdf] # PDF support (PyMuPDF)
pip install kb-arena[docx] # Word documents (mammoth)
pip install kb-arena[web] # Web scraping (httpx)
pip install kb-arena[all-formats] # All of the above
Step 2: Set API keys
Create a .env file or export directly:
export KB_ARENA_ANTHROPIC_API_KEY=sk-ant-... # Claude for generation + evaluation
export KB_ARENA_OPENAI_API_KEY=sk-... # OpenAI for text-embedding-3-large
Step 3: Start Neo4j
KB Arena uses Neo4j for the knowledge graph strategy. One command:
docker compose up neo4j -d
This starts Neo4j on localhost:7687 with default credentials (neo4j / kbarena1). No configuration needed — KB Arena creates the schema automatically.
If you don't have the docker-compose.yml, create one:
services:
neo4j:
image: neo4j:5-community
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_AUTH=neo4j/kbarena1
- NEO4J_PLUGINS=["apoc"]
volumes:
- neo4j_data:/data
volumes:
neo4j_data:
Don't want Docker? KB Arena still works — the vector strategies, RAPTOR, and PageIndex run without Neo4j. Only the knowledge graph and hybrid strategies need it.
Step 4: Run the pipeline
# Scaffold a new corpus
kb-arena init-corpus my-docs
# Drop your docs into the raw/ directory
cp ~/my-documentation/*.md datasets/my-docs/raw/
# Supports: .md, .html, .txt, .pdf, .docx, .csv, .tsv — auto-detected
# Parse into the unified Document model (JSONL intermediate files)
kb-arena ingest datasets/my-docs/raw/ --corpus my-docs
# Or ingest directly from a URL or GitHub repo
kb-arena ingest https://docs.example.com --corpus my-docs
kb-arena ingest github:owner/repo --corpus my-docs
# Build the knowledge graph in Neo4j (entities + relationships)
kb-arena build-graph --corpus my-docs
# Build vector indexes in ChromaDB (local, no server needed)
kb-arena build-vectors --corpus my-docs
# Auto-generate benchmark questions from your docs (10 per difficulty tier)
kb-arena generate-questions --corpus my-docs --count 50
# Run the benchmark (each question x 7 strategies, 4-pass evaluation)
kb-arena benchmark --corpus my-docs
# Launch the web UI to explore results
kb-arena serve
Open http://localhost:8000 for the API, http://localhost:3000 for the dashboard.
Step 5: Read the results
The benchmark produces:
- Accuracy by tier — which strategy handles simple lookups vs multi-hop architecture questions
- Latency percentiles — p50, p95, p99 per strategy
- Cost per query — token usage and API cost
- Composite ranking — 0.5 * accuracy + 0.3 * reliability + 0.2 * latency
Results are saved to results/ as JSON and displayed in the web dashboard.
Full Stack with Docker Compose
Run everything — Neo4j, the API server, and the frontend — in one command:
# Set your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
# Start all services
docker compose up -d
# Open the dashboard
open http://localhost:3000
The compose file starts Neo4j (port 7474/7687), the FastAPI backend (port 8000), and the Next.js frontend (port 3000).
Using the Built-in AWS Example
The AWS Compute corpus ships ready to use (75 questions across 5 difficulty tiers):
kb-arena ingest ./datasets/aws-compute/raw/ --corpus aws-compute
kb-arena build-graph --corpus aws-compute
kb-arena build-vectors --corpus aws-compute
kb-arena benchmark --corpus aws-compute
kb-arena serve
Screenshots
Home — Overview of the 7 strategies, difficulty tiers, and evaluation methodology.
Strategy comparison — Ask the same question to all 7 strategies simultaneously. Compare answers, sources, latency, and cost side-by-side.
Benchmark results — Accuracy table by tier with grouped bar chart.
Knowledge graph — Interactive force-directed visualization of entities extracted from your docs.
Live graph build — Watch entities and relationships stream in as the extractor runs.
Documentation Tools
Beyond benchmarking, KB Arena includes three standalone tools that work on any documentation corpus.
All three tools are available as CLI commands and through the web UI at /tools.
Q&A Generator
Generate Q&A pairs from your docs — use them for chatbot training, FAQ pages, or search indexes. Only needs an Anthropic key (no embeddings, no vector DB).
kb-arena generate-qa --corpus my-docs
# Outputs: datasets/my-docs/qa-pairs/qa_pairs.jsonl
CLI
Web UI
Docs Gap Analyzer
Find what's missing in your documentation before your users complain about it. Generates Q&A pairs per section, self-evaluates them, and classifies each section as strong (>=70%), weak (30-70%), or gap (<30%).
kb-arena audit --corpus my-docs
CLI
Web UI
Fix My Docs
Get actionable recommendations with draft content to improve your docs. Runs the audit internally, then generates prioritized fixes for weak and gap sections.
kb-arena fix --corpus my-docs --max-fixes 5
CLI
Web UI
Pipeline: generate-qa → audit → fix — each command builds on the previous. Or run them independently via CLI or the web UI.
Benchmark Results (AWS Compute Corpus)
Real numbers from 75 questions across 5 difficulty tiers, evaluated with a 4-pass system (structural checks + LLM-as-judge):
| Strategy | Overall | T1 Lookup | T2 How-To | T3 Comparison | T4 Integration | T5 Architecture | Avg Latency | Cost |
|---|---|---|---|---|---|---|---|---|
| Q&A Pairs | 79.2% | 79% | 85% | 83% | 84% | 66% | 9.0s | $0.48 |
| Knowledge Graph | 71.5% | 72% | 69% | 61% | 77% | 79% | 20.3s | $1.37 |
| Hybrid | 64.7% | 39% | 81% | 61% | 80% | 62% | 41.5s | $3.02 |
| RAPTOR | 25.3% | 30% | 16% | 15% | 36% | 30% | 7.2s | $0.69 |
| Naive Vector | 20.7% | 27% | 15% | 14% | 26% | 22% | 6.4s | $0.33 |
| Contextual Vector | 16.5% | 25% | 11% | 9% | 26% | 11% | 5.1s | $0.29 |
| PageIndex | 14.3% | 19% | 12% | 7% | 21% | 12% | 10.9s | $0.29 |
Key takeaway: Q&A pairs dominate overall because pre-generated answers sidestep retrieval failures. Knowledge graph leads on architecture questions (T5: 79%) where structured graph traversal shines. The hybrid strategy adapts per question type but pays a latency/cost premium. PageIndex — the vectorless, reasoning-based approach — scores comparably to contextual vector at $0.29, demonstrating that LLM tree traversal is a viable alternative to embeddings on well-structured docs. RAPTOR's hierarchical retrieval shows strength at T4/T5 but needs a larger corpus. Pure vector RAG — what most teams ship — scores under 21%. Cost ranges from $0.29 (contextual vector / pageindex) to $3.02 (hybrid) for the full 75-question benchmark.
These are results from the built-in AWS Compute corpus. Your mileage will vary — that's the whole point of running it on your own docs.
The 7 Strategies
| # | Strategy | How it works | Best at |
|---|---|---|---|
| 1 | Naive Vector | Chunk → embed → cosine similarity → generate | Fast lookups, simple factoid questions |
| 2 | Contextual Vector | Chunk + parent context → embed → rank | Disambiguating domain-specific terms |
| 3 | Q&A Pairs | LLM pre-generates Q&A at index time → match | Common questions with known answers |
| 4 | Knowledge Graph | Entities → Neo4j → Cypher templates → generate | Multi-hop dependencies, cross-topic queries |
| 5 | Hybrid | Intent routing → vector or graph or both (RRF) | Adapts per question type |
| 6 | RAPTOR | Cluster chunks → LLM topic summaries → recursive tree → query all levels | Cross-document synthesis, broad topic questions |
| 7 | PageIndex | Build tree index from doc structure → LLM beam search traversal → no vectors | Well-structured docs, reasoning over hierarchy |
Question Tiers
Questions are organized into 5 difficulty tiers:
| Tier | Type | Hops | What it tests |
|---|---|---|---|
| 1 | Lookup | 1 | Single-fact lookup from one document |
| 2 | How-To | 1-2 | Multi-step processes, configuration sequences |
| 3 | Comparison | 2-3 | Comparing alternatives, trade-offs between options |
| 4 | Integration | 3-4 | Dependencies and connections between concepts |
| 5 | Architecture | 3-5 | Cross-document synthesis, transitive reasoning |
Use kb-arena generate-questions to auto-generate questions from your docs, or write them by hand in YAML.
Supported Formats
| Format | Extensions / Input | Optional Dep | Notes |
|---|---|---|---|
| Markdown | .md, .markdown, .rst |
— | Heading hierarchy, code blocks, tables |
| HTML | .html, .htm |
— | Strips nav/footer, extracts structure |
| Plain text | .txt, .text |
— | ALL CAPS heading detection |
.pdf |
kb-arena[pdf] |
Font-size heading detection, table extraction | |
| Word | .docx |
kb-arena[docx] |
Converts to HTML, then extracts structure |
| CSV / TSV | .csv, .tsv |
— | Auto-detects delimiter, groups rows into sections |
| Web URL | https://... |
kb-arena[web] |
Crawls same-domain pages; uses /llms.txt if available |
| GitHub | github:owner/repo |
— | Shallow clones and ingests all doc files |
| SEC EDGAR | --format sec-edgar |
— | 10-K/10-Q filing parser |
Universal Documentation Schema
KB Arena extracts entities and relationships using a universal schema that works for any documentation domain:
5 node types: Topic, Component, Process, Config, Constraint 7 relationship types: DEPENDS_ON, CONTAINS, CONNECTS_TO, TRIGGERS, CONFIGURES, ALTERNATIVE_TO, EXTENDS
No per-domain configuration needed. The LLM maps your documentation concepts to these types automatically.
CLI Reference
| Command | Description |
|---|---|
demo |
Launch dashboard with pre-computed results (no API keys needed) |
init-corpus <name> |
Scaffold datasets/{name}/ directories |
ingest <path> |
Parse docs into JSONL. Accepts files, dirs, URLs, github:owner/repo. Options: --corpus, --format, --dry-run |
build-graph |
Extract entities/rels into Neo4j. Options: --corpus |
build-vectors |
Build vector indexes + PageIndex tree. Options: --corpus, --strategy |
generate-questions |
Auto-generate benchmark questions. Options: --corpus, --count |
benchmark |
Run evaluation. Options: --corpus, --strategy, --tier, --dry-run |
generate-qa |
Generate Q&A pairs from your docs as JSONL. Options: --corpus, --output |
audit |
Find documentation gaps — classifies sections as strong/weak/gap. Options: --corpus, --output, --max-sections |
fix |
Generate fix recommendations with draft content. Options: --corpus, --max-fixes, --output |
report |
Generate report. Options: --corpus, --output, --format (rich/json) |
serve |
Launch API + frontend. Options: --host, --port |
health |
Pipeline status. Options: --format (rich/json) |
All commands are independently re-runnable. Each stage writes to disk so you can re-run any step without repeating earlier ones.
CLI Features
Dry run — Preview what a command will do before committing to expensive LLM calls:
kb-arena ingest datasets/my-docs/raw/ --corpus my-docs --dry-run
# Shows: file count by extension, parser assignment, output path
kb-arena benchmark --corpus my-docs --dry-run
# Shows: question count, strategy list, total queries, concurrency settings
JSON output — Pipe structured data to jq, scripts, or CI pipelines:
kb-arena report --corpus my-docs --format json | jq '.corpora'
kb-arena health --format json | jq '.services'
Pipeline hints — After every command, see what to run next:
$ kb-arena ingest datasets/my-docs/raw/ --corpus my-docs
Done. 12 documents, 47 sections → datasets/my-docs/processed/documents.jsonl
Next: kb-arena build-graph --corpus my-docs && kb-arena build-vectors --corpus my-docs
Progress bars — Every long-running command shows real-time progress (extraction sections, Neo4j batch loading, vector index building, question generation tiers).
Cost tracking — Benchmark runs display cumulative API cost in the progress bar and print per-strategy cost/accuracy summaries after completion.
Environment Variables
All prefixed with KB_ARENA_. Loaded from .env or environment.
| Variable | Default | Required | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
— | Yes | Claude for generation, evaluation, extraction |
OPENAI_API_KEY |
— | Yes | OpenAI for text-embedding-3-large |
NEO4J_URI |
bolt://localhost:7687 |
No | Neo4j connection |
NEO4J_USER |
neo4j |
No | Neo4j username |
NEO4J_PASSWORD |
— | No | Neo4j password (set to match NEO4J_AUTH in docker-compose) |
JUDGE_MODEL |
claude-opus-4-6 |
No | Model used for LLM-as-judge evaluation (default differs from generate model to avoid self-evaluation bias) |
CHROMA_PATH |
./chroma_data |
No | ChromaDB storage path |
EMBEDDING_MODEL |
text-embedding-3-large |
No | OpenAI embedding model |
EMBEDDING_DIMENSIONS |
3072 |
No | Embedding vector dimensions |
GENERATE_MODEL |
claude-sonnet-4-6 |
No | Generation model |
FAST_MODEL |
claude-haiku-4-5-20251001 |
No | Classification model |
HOST |
0.0.0.0 |
No | Server bind address |
PORT |
8000 |
No | Server port |
DEBUG |
false |
No | Debug mode |
BENCHMARK_TEMPERATURE |
0.0 |
No | LLM temperature for benchmarks |
BENCHMARK_MAX_CONCURRENT |
5 |
No | Parallel benchmark queries |
BENCHMARK_QUERY_TIMEOUT_S |
120 |
No | Per-query timeout (seconds) |
BENCHMARK_MAX_RETRIES |
2 |
No | Retry count on failures |
PAGEINDEX_BEAM_WIDTH |
3 |
No | Branches to explore per tree level |
PAGEINDEX_MAX_DEPTH |
4 |
No | Maximum tree traversal depth |
DATASETS_PATH |
./datasets |
No | Datasets directory |
RESULTS_PATH |
./results |
No | Results output directory |
Development
# Install with dev dependencies
pip install -e '.[dev]'
# Run tests
pytest tests/ -v --ignore=tests/live # 454 tests
# Lint + format
ruff check . && ruff format --check .
# Frontend
cd web && npm install && npx next build
Tech Stack
| Component | Technology |
|---|---|
| LLM | Claude Sonnet 4.6 (generation) + Haiku 4.5 (classification) |
| Embeddings | text-embedding-3-large (3072-dim) |
| Vector store | ChromaDB 0.5 (local, no server) |
| Graph store | Neo4j 5 Community |
| Backend | FastAPI + SSE streaming |
| Frontend | Next.js 14 + Tailwind + Recharts |
| Models | Pydantic v2 |
| CLI | Typer + Rich |
| Testing | pytest (454 tests) |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kb_arena-0.2.1.tar.gz.
File metadata
- Download URL: kb_arena-0.2.1.tar.gz
- Upload date:
- Size: 852.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d65064bbdf0eeb974db5420f6e66aedbd4c4073a891fad21b3535d0c8eddebc
|
|
| MD5 |
99ebe6dd811c15b33939169f30ebf859
|
|
| BLAKE2b-256 |
b17e9c2c4a0af2f21b405da668520013016912853f1ad449539bf80076d3b889
|
File details
Details for the file kb_arena-0.2.1-py3-none-any.whl.
File metadata
- Download URL: kb_arena-0.2.1-py3-none-any.whl
- Upload date:
- Size: 509.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a95991d7dafe3cb7ec4bae8dfda8e674fad826321358968649de82b511cc2114
|
|
| MD5 |
6cf621f4ac43860130dfda9af7b2ecd2
|
|
| BLAKE2b-256 |
c46a3a2407bf8e80b3bf8bad1770555883b6ab1e371ae0fc262943a67c8f4fb2
|