Skip to main content

Retrieval + Inference benchmark for LLM-powered codebase Q&A

Project description

munch-bench

Retrieval + Inference benchmark for LLM-powered codebase Q&A.

munch-bench measures how well different LLMs answer questions about real codebases when given retrieval-augmented context from jCodeMunch.

What it measures

Metric Description
Retrieval P@5 / P@10 Fraction of top-k retrieved symbols that match ground truth
Retrieval Recall Fraction of ground-truth symbols found in retrieved context
LLM Judge Score 0–1 accuracy rating by an LLM judge comparing answers to ground truth
Exact Match Rate Whether the answer contains key content from ground truth
Wall Time End-to-end latency (retrieval + inference)
Cost Estimated API cost per run

Corpus

110 questions across 11 repos spanning Python, JavaScript, TypeScript, Go, Rust, and Java:

Repo Questions Languages
pallets/flask 11 Python
tiangolo/fastapi 10 Python
django/django 10 Python
psf/requests 10 Python
langchain-ai/langchain 10 Python
pytorch/pytorch 10 Python/C++
expressjs/express 10 JavaScript
facebook/react 10 JavaScript
vercel/next.js 10 TypeScript
vuejs/core 10 TypeScript
gin-gonic/gin 10 Go
tokio-rs/axum 10 Rust
spring-projects/spring-boot 10 Java
jgravelle/jcodemunch-mcp 10 Python

Questions are categorized by difficulty (easy/medium/hard) and type (api/architecture/debugging/refactoring).

Quick start

pip install munch-bench

Prerequisites

  1. Index the repos you want to benchmark against using jCodeMunch:

    jcodemunch-mcp index pallets/flask
    jcodemunch-mcp index tiangolo/fastapi
    # ... etc
    
  2. Set API keys for your chosen provider:

    export GROQ_API_KEY=gsk_...        # for Groq
    export OPENAI_API_KEY=sk-...       # for OpenAI
    export ANTHROPIC_API_KEY=sk-ant-...  # for Anthropic
    

Run a benchmark

# Run with Groq (default — fastest + cheapest)
munch-bench run --provider groq

# Run with a specific model
munch-bench run --provider groq --model llama-3.3-70b-versatile

# Run with OpenAI
munch-bench run --provider openai --model gpt-4o-mini

# Run with Anthropic
munch-bench run --provider anthropic --model claude-sonnet-4-6

# Filter to specific repos or difficulty
munch-bench run --provider groq --repo pallets/flask --difficulty hard

# Custom token budget for retrieval
munch-bench run --provider groq --token-budget 16000 -v

Results are saved to results/<provider>_<model>_<date>.json.

Compare runs and generate leaderboard

# Compare multiple runs
munch-bench compare results/*.json -o leaderboard.html

# View corpus statistics
munch-bench corpus-stats

Leaderboard

The leaderboard is a static HTML page with interactive Chart.js visualizations, deployed to GitHub Pages on every push to main.

Architecture

munch-bench/
  corpus/              # YAML question files (one per repo)
  src/munch_bench/
    cli.py             # CLI entrypoint: run, compare, corpus-stats
    corpus.py          # YAML corpus loader + filtering
    retrieval.py       # jCodeMunch retrieval wrapper
    inference.py       # Provider dispatch (Groq, OpenAI, Anthropic)
    evaluate.py        # Metrics: P@k, recall, exact match, LLM judge
    runner.py          # Orchestrator with rich progress
    leaderboard.py     # Static HTML + Chart.js generator
  tests/
  results/             # JSON benchmark outputs (gitignored)

Adding questions

Create a YAML file in corpus/:

repo: owner/name
questions:
  - id: unique-id-001
    question: "How does X work?"
    ground_truth_answer: "X works by..."
    ground_truth_symbols: ["function_name", "ClassName"]
    difficulty: medium  # easy, medium, hard
    category: architecture  # api, architecture, debugging, refactoring
    tags: [optional, tags]

License

Apache 2.0


Powered by jCodeMunch + Groq

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

munch_bench-0.2.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

munch_bench-0.2.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file munch_bench-0.2.0.tar.gz.

File metadata

  • Download URL: munch_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 dd256810b987789c3df9e65ab3f30b6a94f2629f6f76190adf33a45017e038bd
MD5 b5ffab573a8ff69e6c969a36b9e501b1
BLAKE2b-256 c2c78bb68ce1f29c61d643e1d8301c859a9b47690039e2d02ebcf572da134393

See more details on using hashes here.

File details

Details for the file munch_bench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: munch_bench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for munch_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cfb004c6489f81e51ebd968b02c8a0c32b8906ca9a6f144de2cfc4b837a6f2fe
MD5 ecee31650ae097105b835f54ab93d743
BLAKE2b-256 4dc81238878e6b7fec397a303d321843bf3e84ebd0f2d6f0eae10d918895a79f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page