Skip to main content

Retrieval + Inference benchmark for LLM-powered codebase Q&A

Project description

munch-bench

Retrieval + Inference benchmark for LLM-powered codebase Q&A.

munch-bench measures how well different LLMs answer questions about real codebases when given retrieval-augmented context from jCodeMunch.

What it measures

Metric Description
Retrieval P@5 / P@10 Fraction of top-k retrieved symbols that match ground truth
Retrieval Recall Fraction of ground-truth symbols found in retrieved context
LLM Judge Score 0–1 accuracy rating by an LLM judge comparing answers to ground truth
Exact Match Rate Whether the answer contains key content from ground truth
Wall Time End-to-end latency (retrieval + inference)
Cost Estimated API cost per run

Corpus

110 questions across 11 repos spanning Python, JavaScript, TypeScript, Go, Rust, and Java:

Repo Questions Languages
pallets/flask 11 Python
tiangolo/fastapi 10 Python
django/django 10 Python
psf/requests 10 Python
langchain-ai/langchain 10 Python
pytorch/pytorch 10 Python/C++
expressjs/express 10 JavaScript
facebook/react 10 JavaScript
vercel/next.js 10 TypeScript
vuejs/core 10 TypeScript
gin-gonic/gin 10 Go
tokio-rs/axum 10 Rust
spring-projects/spring-boot 10 Java
jgravelle/jcodemunch-mcp 10 Python

Questions are categorized by difficulty (easy/medium/hard) and type (api/architecture/debugging/refactoring).

Quick start

pip install munch-bench

Prerequisites

  1. Index the repos you want to benchmark against using jCodeMunch:

    jcodemunch-mcp index pallets/flask
    jcodemunch-mcp index tiangolo/fastapi
    # ... etc
    
  2. Set API keys for your chosen provider:

    export GROQ_API_KEY=gsk_...        # for Groq
    export OPENAI_API_KEY=sk-...       # for OpenAI
    export ANTHROPIC_API_KEY=sk-ant-...  # for Anthropic
    

Run a benchmark

# Run with Groq (default — fastest + cheapest)
munch-bench run --provider groq

# Run with a specific model
munch-bench run --provider groq --model llama-3.3-70b-versatile

# Run with OpenAI
munch-bench run --provider openai --model gpt-4o-mini

# Run with Anthropic
munch-bench run --provider anthropic --model claude-sonnet-4-6

# Filter to specific repos or difficulty
munch-bench run --provider groq --repo pallets/flask --difficulty hard

# Custom token budget for retrieval
munch-bench run --provider groq --token-budget 16000 -v

Results are saved to results/<provider>_<model>_<date>.json.

Compare runs and generate leaderboard

# Compare multiple runs
munch-bench compare results/*.json -o leaderboard.html

# View corpus statistics
munch-bench corpus-stats

Leaderboard

The leaderboard is a static HTML page with interactive Chart.js visualizations, deployed to GitHub Pages on every push to main.

Architecture

munch-bench/
  corpus/              # YAML question files (one per repo)
  src/munch_bench/
    cli.py             # CLI entrypoint: run, compare, corpus-stats
    corpus.py          # YAML corpus loader + filtering
    retrieval.py       # jCodeMunch retrieval wrapper
    inference.py       # Provider dispatch (Groq, OpenAI, Anthropic)
    evaluate.py        # Metrics: P@k, recall, exact match, LLM judge
    runner.py          # Orchestrator with rich progress
    leaderboard.py     # Static HTML + Chart.js generator
  tests/
  results/             # JSON benchmark outputs (gitignored)

Adding questions

Create a YAML file in corpus/:

repo: owner/name
questions:
  - id: unique-id-001
    question: "How does X work?"
    ground_truth_answer: "X works by..."
    ground_truth_symbols: ["function_name", "ClassName"]
    difficulty: medium  # easy, medium, hard
    category: architecture  # api, architecture, debugging, refactoring
    tags: [optional, tags]

License

Apache 2.0


Powered by jCodeMunch + Groq

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

munch_bench-0.1.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

munch_bench-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file munch_bench-0.1.0.tar.gz.

File metadata

  • Download URL: munch_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fd4f911b500193f4572bd4a88404de3d75e83133007aae32e0c278f32a3ac6b7
MD5 6a3b13ff6659fcfd310bb481a440f913
BLAKE2b-256 46cf20bd9e1184e9d99c85d1aa48811099360b8a412cbc2c66c07e879ba80939

See more details on using hashes here.

File details

Details for the file munch_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: munch_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85d548053179cc4ab597faf2a4a0e219155bfde695ed3bf0c5107ead16b3e823
MD5 458af5ae66ee6dd38f87b2e9cdcf23b7
BLAKE2b-256 181e9f2175379162eaea4e3a7ba1840e2a3a1c325da0e3f934546c5ed5fabdbc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page