Benchmark AI coding agents against your own codebase. Mine real tasks from repo history, run agents, interpret results.
Project description
codeprobe
Benchmark AI coding agents against your own codebase.
Mine real tasks from your repo history, run agents against them, and find out which setup actually works best for YOUR code — not someone else's benchmark suite.
Why codeprobe?
Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data. codeprobe mines tasks from your private repo history, producing benchmarks that are impossible to contaminate.
Quick Start
pip install codeprobe # Core (mine + run + interpret)
pip install codeprobe[stats] # + statistical tests (scipy)
pip install codeprobe[tokens] # + exact Copilot token counting (tiktoken)
pip install codeprobe[all] # Everything
cd /path/to/your/repo
codeprobe init # What do you want to learn?
codeprobe mine . # Extract tasks from repo history
codeprobe run . # Run agents against tasks
codeprobe interpret . # Get recommendations
Commands
| Command | Purpose |
|---|---|
codeprobe init |
Interactive wizard — choose what to compare |
codeprobe mine |
Mine eval tasks from merged PRs/MRs |
codeprobe run |
Execute tasks against AI agents |
codeprobe interpret |
Analyze results, rank configurations |
codeprobe assess |
Score a codebase's benchmarking potential |
Supported Agents
- Claude Code (
--agent claude) - GitHub Copilot (
--agent copilot) - Custom agents via the
AgentAdapterprotocol
Supported Git Hosts
GitHub, GitLab, Bitbucket, Azure DevOps, Gitea/Forgejo, and local repos.
Configuration
Create a .evalrc.yaml in your repo root:
name: my-experiment
agents: [claude, copilot]
models: [claude-sonnet-4-6, claude-opus-4-6]
tasks_dir: .codeprobe/tasks
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codeprobe-0.1.0a1.tar.gz.
File metadata
- Download URL: codeprobe-0.1.0a1.tar.gz
- Upload date:
- Size: 175.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71f3db4bd13d39c730428b564f7778ae8cf0004edebc202cbf84fefc066ca7a6
|
|
| MD5 |
62e5c80e429bb217a6b22c1da35215dd
|
|
| BLAKE2b-256 |
f45d7e9b32fcbc7cf9b1b8d96036dd7b011ad3bbeb0caf0d2f9c2f1fa0627697
|
File details
Details for the file codeprobe-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: codeprobe-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 132.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3470803c337d34c50952f1c0110c5eaae678d3755b120dc1fa4c2df51b311aba
|
|
| MD5 |
c4385ccc4c8a9f3dc087eb27b09bcdf2
|
|
| BLAKE2b-256 |
2ed5e1e20745da01b8511594cb65de6eb69d275700fd49748ce00bc1a11978f2
|