Skip to main content

Benchmark AI coding agents against your own codebase. Mine real tasks from repo history, run agents, interpret results.

Project description

codeprobe

Benchmark AI coding agents against your own codebase.

Mine real tasks from your repo history, run agents against them, and find out which setup actually works best for YOUR code — not someone else's benchmark suite.

Why codeprobe?

Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data. codeprobe mines tasks from your private repo history, producing benchmarks that are impossible to contaminate.

Quick Start

pip install codeprobe            # Core (mine + run + interpret)
pip install codeprobe[stats]     # + statistical tests (scipy)
pip install codeprobe[tokens]    # + exact Copilot token counting (tiktoken)
pip install codeprobe[all]       # Everything

cd /path/to/your/repo

codeprobe init          # What do you want to learn?
codeprobe mine .        # Extract tasks from repo history
codeprobe run .         # Run agents against tasks
codeprobe interpret .   # Get recommendations

Commands

Command Purpose
codeprobe init Interactive wizard — choose what to compare
codeprobe mine Mine eval tasks from merged PRs/MRs
codeprobe run Execute tasks against AI agents
codeprobe interpret Analyze results, rank configurations
codeprobe assess Score a codebase's benchmarking potential

Supported Agents

  • Claude Code (--agent claude)
  • GitHub Copilot (--agent copilot)
  • Custom agents via the AgentAdapter protocol

Supported Git Hosts

GitHub, GitLab, Bitbucket, Azure DevOps, Gitea/Forgejo, and local repos.

Configuration

Create a .evalrc.yaml in your repo root:

name: my-experiment
agents: [claude, copilot]
models: [claude-sonnet-4-6, claude-opus-4-6]
tasks_dir: .codeprobe/tasks

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codeprobe-0.1.0a1.tar.gz (175.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codeprobe-0.1.0a1-py3-none-any.whl (132.3 kB view details)

Uploaded Python 3

File details

Details for the file codeprobe-0.1.0a1.tar.gz.

File metadata

  • Download URL: codeprobe-0.1.0a1.tar.gz
  • Upload date:
  • Size: 175.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for codeprobe-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 71f3db4bd13d39c730428b564f7778ae8cf0004edebc202cbf84fefc066ca7a6
MD5 62e5c80e429bb217a6b22c1da35215dd
BLAKE2b-256 f45d7e9b32fcbc7cf9b1b8d96036dd7b011ad3bbeb0caf0d2f9c2f1fa0627697

See more details on using hashes here.

File details

Details for the file codeprobe-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: codeprobe-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 132.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for codeprobe-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 3470803c337d34c50952f1c0110c5eaae678d3755b120dc1fa4c2df51b311aba
MD5 c4385ccc4c8a9f3dc087eb27b09bcdf2
BLAKE2b-256 2ed5e1e20745da01b8511594cb65de6eb69d275700fd49748ce00bc1a11978f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page