Retrieval + Inference benchmark for LLM-powered codebase Q&A

These details have not been verified by PyPI

Project links

Project description

munch-bench

Retrieval + Inference benchmark for LLM-powered codebase Q&A.

munch-bench measures how well different LLMs answer questions about real codebases when given retrieval-augmented context from jCodeMunch.

What it measures

Metric	Description
Retrieval P@5 / P@10	Fraction of top-k retrieved symbols that match ground truth
Retrieval Recall	Fraction of ground-truth symbols found in retrieved context
LLM Judge Score	0–1 accuracy rating by an LLM judge comparing answers to ground truth
Exact Match Rate	Whether the answer contains key content from ground truth
Wall Time	End-to-end latency (retrieval + inference)
Cost	Estimated API cost per run

Corpus

110 questions across 11 repos spanning Python, JavaScript, TypeScript, Go, Rust, and Java:

Repo	Questions	Languages
pallets/flask	11	Python
tiangolo/fastapi	10	Python
django/django	10	Python
psf/requests	10	Python
langchain-ai/langchain	10	Python
pytorch/pytorch	10	Python/C++
expressjs/express	10	JavaScript
facebook/react	10	JavaScript
vercel/next.js	10	TypeScript
vuejs/core	10	TypeScript
gin-gonic/gin	10	Go
tokio-rs/axum	10	Rust
spring-projects/spring-boot	10	Java
jgravelle/jcodemunch-mcp	10	Python

Questions are categorized by difficulty (easy/medium/hard) and type (api/architecture/debugging/refactoring).

Quick start

pip install munch-bench

Prerequisites

Index the repos you want to benchmark against using jCodeMunch:

jcodemunch-mcp index pallets/flask
jcodemunch-mcp index tiangolo/fastapi
# ... etc

Set API keys for your chosen provider:

export GROQ_API_KEY=gsk_...        # for Groq
export OPENAI_API_KEY=sk-...       # for OpenAI
export ANTHROPIC_API_KEY=sk-ant-...  # for Anthropic

Run a benchmark

# Run with Groq (default — fastest + cheapest)
munch-bench run --provider groq

# Run with a specific model
munch-bench run --provider groq --model llama-3.3-70b-versatile

# Run with OpenAI
munch-bench run --provider openai --model gpt-4o-mini

# Run with Anthropic
munch-bench run --provider anthropic --model claude-sonnet-4-6

# Filter to specific repos or difficulty
munch-bench run --provider groq --repo pallets/flask --difficulty hard

# Custom token budget for retrieval
munch-bench run --provider groq --token-budget 16000 -v

Results are saved to results/<provider>_<model>_<date>.json.

Compare runs and generate leaderboard

# Compare multiple runs
munch-bench compare results/*.json -o leaderboard.html

# View corpus statistics
munch-bench corpus-stats

Leaderboard

The leaderboard is a static HTML page with interactive Chart.js visualizations, deployed to GitHub Pages on every push to main.

Architecture

munch-bench/
  corpus/              # YAML question files (one per repo)
  src/munch_bench/
    cli.py             # CLI entrypoint: run, compare, corpus-stats
    corpus.py          # YAML corpus loader + filtering
    retrieval.py       # jCodeMunch retrieval wrapper
    inference.py       # Provider dispatch (Groq, OpenAI, Anthropic)
    evaluate.py        # Metrics: P@k, recall, exact match, LLM judge
    runner.py          # Orchestrator with rich progress
    leaderboard.py     # Static HTML + Chart.js generator
  tests/
  results/             # JSON benchmark outputs (gitignored)

Adding questions

Create a YAML file in corpus/:

repo: owner/name
questions:
  - id: unique-id-001
    question: "How does X work?"
    ground_truth_answer: "X works by..."
    ground_truth_symbols: ["function_name", "ClassName"]
    difficulty: medium  # easy, medium, hard
    category: architecture  # api, architecture, debugging, refactoring
    tags: [optional, tags]

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 13, 2026

This version

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

munch_bench-0.1.0.tar.gz (18.2 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

munch_bench-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file munch_bench-0.1.0.tar.gz.

File metadata

Download URL: munch_bench-0.1.0.tar.gz
Upload date: Apr 13, 2026
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fd4f911b500193f4572bd4a88404de3d75e83133007aae32e0c278f32a3ac6b7`
MD5	`6a3b13ff6659fcfd310bb481a440f913`
BLAKE2b-256	`46cf20bd9e1184e9d99c85d1aa48811099360b8a412cbc2c66c07e879ba80939`

See more details on using hashes here.

File details

Details for the file munch_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: munch_bench-0.1.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85d548053179cc4ab597faf2a4a0e219155bfde695ed3bf0c5107ead16b3e823`
MD5	`458af5ae66ee6dd38f87b2e9cdcf23b7`
BLAKE2b-256	`181e9f2175379162eaea4e3a7ba1840e2a3a1c325da0e3f934546c5ed5fabdbc`

See more details on using hashes here.

munch-bench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

munch-bench

What it measures

Corpus

Quick start

Prerequisites

Run a benchmark

Compare runs and generate leaderboard

Leaderboard

Architecture

Adding questions

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes