Retrieval + Inference benchmark for LLM-powered codebase Q&A
Project description
munch-bench
Retrieval + Inference benchmark for LLM-powered codebase Q&A.
munch-bench measures how well different LLMs answer questions about real codebases when given retrieval-augmented context from jCodeMunch.
What it measures
| Metric | Description |
|---|---|
| Retrieval P@5 / P@10 | Fraction of top-k retrieved symbols that match ground truth |
| Retrieval Recall | Fraction of ground-truth symbols found in retrieved context |
| LLM Judge Score | 0–1 accuracy rating by an LLM judge comparing answers to ground truth |
| Exact Match Rate | Whether the answer contains key content from ground truth |
| Wall Time | End-to-end latency (retrieval + inference) |
| Cost | Estimated API cost per run |
Corpus
110 questions across 11 repos spanning Python, JavaScript, TypeScript, Go, Rust, and Java:
| Repo | Questions | Languages |
|---|---|---|
| pallets/flask | 11 | Python |
| tiangolo/fastapi | 10 | Python |
| django/django | 10 | Python |
| psf/requests | 10 | Python |
| langchain-ai/langchain | 10 | Python |
| pytorch/pytorch | 10 | Python/C++ |
| expressjs/express | 10 | JavaScript |
| facebook/react | 10 | JavaScript |
| vercel/next.js | 10 | TypeScript |
| vuejs/core | 10 | TypeScript |
| gin-gonic/gin | 10 | Go |
| tokio-rs/axum | 10 | Rust |
| spring-projects/spring-boot | 10 | Java |
| jgravelle/jcodemunch-mcp | 10 | Python |
Questions are categorized by difficulty (easy/medium/hard) and type (api/architecture/debugging/refactoring).
Quick start
pip install munch-bench
Prerequisites
-
Index the repos you want to benchmark against using jCodeMunch:
jcodemunch-mcp index pallets/flask jcodemunch-mcp index tiangolo/fastapi # ... etc
-
Set API keys for your chosen provider:
export GROQ_API_KEY=gsk_... # for Groq export OPENAI_API_KEY=sk-... # for OpenAI export ANTHROPIC_API_KEY=sk-ant-... # for Anthropic
Run a benchmark
# Run with Groq (default — fastest + cheapest)
munch-bench run --provider groq
# Run with a specific model
munch-bench run --provider groq --model llama-3.3-70b-versatile
# Run with OpenAI
munch-bench run --provider openai --model gpt-4o-mini
# Run with Anthropic
munch-bench run --provider anthropic --model claude-sonnet-4-6
# Filter to specific repos or difficulty
munch-bench run --provider groq --repo pallets/flask --difficulty hard
# Custom token budget for retrieval
munch-bench run --provider groq --token-budget 16000 -v
Results are saved to results/<provider>_<model>_<date>.json.
Compare runs and generate leaderboard
# Compare multiple runs
munch-bench compare results/*.json -o leaderboard.html
# View corpus statistics
munch-bench corpus-stats
Leaderboard
The leaderboard is a static HTML page with interactive Chart.js visualizations, deployed to GitHub Pages on every push to main.
Architecture
munch-bench/
corpus/ # YAML question files (one per repo)
src/munch_bench/
cli.py # CLI entrypoint: run, compare, corpus-stats
corpus.py # YAML corpus loader + filtering
retrieval.py # jCodeMunch retrieval wrapper
inference.py # Provider dispatch (Groq, OpenAI, Anthropic)
evaluate.py # Metrics: P@k, recall, exact match, LLM judge
runner.py # Orchestrator with rich progress
leaderboard.py # Static HTML + Chart.js generator
tests/
results/ # JSON benchmark outputs (gitignored)
Adding questions
Create a YAML file in corpus/:
repo: owner/name
questions:
- id: unique-id-001
question: "How does X work?"
ground_truth_answer: "X works by..."
ground_truth_symbols: ["function_name", "ClassName"]
difficulty: medium # easy, medium, hard
category: architecture # api, architecture, debugging, refactoring
tags: [optional, tags]
License
Apache 2.0
Powered by jCodeMunch + Groq
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file munch_bench-0.2.0.tar.gz.
File metadata
- Download URL: munch_bench-0.2.0.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd256810b987789c3df9e65ab3f30b6a94f2629f6f76190adf33a45017e038bd
|
|
| MD5 |
b5ffab573a8ff69e6c969a36b9e501b1
|
|
| BLAKE2b-256 |
c2c78bb68ce1f29c61d643e1d8301c859a9b47690039e2d02ebcf572da134393
|
File details
Details for the file munch_bench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: munch_bench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfb004c6489f81e51ebd968b02c8a0c32b8906ca9a6f144de2cfc4b837a6f2fe
|
|
| MD5 |
ecee31650ae097105b835f54ab93d743
|
|
| BLAKE2b-256 |
4dc81238878e6b7fec397a303d321843bf3e84ebd0f2d6f0eae10d918895a79f
|