Rigorous evaluation tool for comparing LLM coding agents
Project description
VibeLab
⚠️ ALPHA RELEASE - USE WITH CAUTION
This project is in alpha and under active development. Breaking changes are expected and will occur. Use at your own risk.
A rigorous evaluation tool for comparing LLM coding agents.
Overview
VibeLab helps software engineers evaluate and compare LLM coding agents (Claude Code, OpenAI Codex, Cursor, Gemini CLI) through controlled experiments. Instead of ad-hoc "vibe checks," get reproducible, comparable results across different agent configurations.
Features
- Comparative Runs: Test the same task across multiple agents side-by-side
- Datasets: Organize scenarios into collections for batch evaluation
- Result Tracking: Persistent storage of all runs with code diffs, logs, and metrics
- Human Feedback: Add notes and quality scores (Perfect/Good/Workable/Bad) to evaluate run fitness
- LLM Judges: Automatic graders that mimic human scores using few-shot examples
- Judgements: LLM-generated assessments with alignment scores showing judge-human correlation
- Web Dashboard: Visual comparison interface with diff viewer and dataset analytics
- Extensible: Add new agent harnesses by implementing a simple protocol
Installation
VibeLab can be run directly with uvx without installation:
# Run commands directly with uvx (no installation needed)
uvx vibelab start start-cmd
Or install it permanently:
# Install with uv
uv tool install vibelab
# Or with pip
pip install vibelab
Prerequisites
- Python 3.11+
- Git
- Agent CLIs you plan to use:
# Claude Code
npm install -g @anthropic-ai/claude-code
# OpenAI Codex
npm install -g @openai/codex
Quick Start
Run a comparison
# Compare Claude Code and Codex on the same task
uvx vibelab run run-cmd \
--code github:owner/repo@main \
--prompt "Add input validation to the login form" \
--executor claude-code:anthropic:sonnet \
--executor openai-codex:openai:gpt-4o
View results
# List recent results
uvx vibelab result list
# View a specific result
uvx vibelab result get <result-id>
# View the code diff
uvx vibelab result diff <result-id>
Launch the web UI
# Production mode (serves built frontend from package)
uvx vibelab start start-cmd
# Development mode (starts frontend dev server)
uvx vibelab start start-cmd --dev
CLI Reference
vibelab run
Execute a scenario against one or more executors.
uvx vibelab run run-cmd \
--code <CODE_REF> \ # github:owner/repo@ref or local:/path
--prompt <TEXT> \ # Task instructions
--executor <SPEC> \ # harness:provider:model (repeatable)
[--timeout <SECONDS>] \ # Default: 1800
[--driver <DRIVER>] # local (default), docker, modal
Options:
--code: Repository reference. Formats:github:owner/repo- Latest default branchgithub:owner/repo@branch- Specific branchgithub:owner/repo#commit- Specific commitlocal:/path/to/repo- Local directory
--prompt: Task instructions for the agent--executor: Agent specification asharness:provider:model(can be repeated)--timeout: Maximum execution time per agent in seconds (default: 1800)--driver: Execution driver:local,docker, ormodal
vibelab scenario
Manage scenarios (code + prompt combinations).
uvx vibelab scenario create --code <REF> --prompt <TEXT>
uvx vibelab scenario list [--limit N]
uvx vibelab scenario get <ID>
vibelab dataset
Manage datasets (collections of scenarios for batch evaluation).
uvx vibelab dataset create --name <NAME> [--description <TEXT>]
uvx vibelab dataset list [--limit N]
uvx vibelab dataset get <ID>
uvx vibelab dataset delete <ID>
uvx vibelab dataset add-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset remove-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset run --dataset <ID> --executor <SPEC> [--trials N] [--minimal]
Options:
--trials: Number of runs per scenario-executor pair (default: 1)--minimal: Only run scenario-executor pairs that don't have completed results--executor: Agent specification asharness:provider:model(can be repeated)
vibelab result
View and filter results, update notes and quality scores.
uvx vibelab result list [--scenario ID] [--executor SPEC] [--status STATUS]
uvx vibelab result get <ID>
uvx vibelab result diff <ID>
uvx vibelab result update-notes <ID> [--notes TEXT] [--clear]
uvx vibelab result update-quality <ID> [--quality 1-4] [--clear]
uvx vibelab result update <ID> [--notes TEXT] [--quality 1-4] [--clear-notes] [--clear-quality]
Filter options:
--scenario: Filter by scenario ID--executor: Filter by executor spec (partial match)--status: Filter by status:queued,running,completed,failed,timeout
Update commands:
update-notes: Add or update notes for a result. Use--notes "-"to read from stdin, or--clearto remove notes.update-quality: Set quality score (1=Bad, 2=Workable, 3=Good, 4=Perfect). Use--clearto remove score.update: Update both notes and quality in one command.
vibelab executor
List available agent configurations.
uvx vibelab executor list
uvx vibelab executor list --harness claude-code
uvx vibelab executor list --harness claude-code --provider anthropic
vibelab start
Launch the web server.
uvx vibelab start start-cmd [--port 8000] [--host 127.0.0.1] [--frontend-port 5173] [--dev/--no-dev]
Options:
--port: Backend server port (default: 8000)--host: Backend server host (default: 127.0.0.1)--frontend-port: Frontend dev server port (default: 5173, only used with --dev)--dev/--no-dev: Development mode with frontend dev server, or production mode serving static files (default: --no-dev, production mode)
Configuration
VibeLab stores data in ~/.vibelab/ by default.
Environment Variables
| Variable | Default | Description |
|---|---|---|
VIBELAB_HOME |
~/.vibelab |
Data directory |
VIBELAB_DRIVER |
local |
Default execution driver |
VIBELAB_TIMEOUT |
1800 |
Default timeout (seconds) |
VIBELAB_LOG_LEVEL |
INFO |
Logging verbosity |
API Keys
Configure API keys for the agents you want to use:
# Claude Code
export ANTHROPIC_API_KEY=sk-ant-...
# OpenAI Codex
export OPENAI_API_KEY=sk-...
# Cursor
export CURSOR_API_KEY=your-cursor-api-key
Supported Agents
| Harness | Provider | Models |
|---|---|---|
claude-code |
anthropic |
opus, sonnet, haiku |
openai-codex |
openai |
gpt-4o, o3, o4-mini |
cursor |
cursor |
composer-1 |
Data Layout
~/.vibelab/
├── data.db # SQLite database
└── results/
└── {result_id}/
├── patch.diff # Git patch of changes
├── stdout.log
├── stderr.log
└── harness/ # Harness-specific artifacts
└── trajectory.json
Web UI
The web interface provides:
- Dashboard: Recent scenarios and quick actions
- Scenarios: Table view of all scenarios with metrics
- Datasets: Collections of scenarios for batch evaluation
- Dataset Analytics: Matrix view showing scenario-executor status across all combinations
- Runs: Table view of all runs across scenarios with quality scores
- Executors: Single table view of all executor tuples (harness:provider:model) with filtering, selection, and quick run creation
- Run Creation: Form to configure and launch new comparisons
- Scenario Detail: Table view of all results for a scenario with comparison, judge management, and judgement display
- Compare Mode: Side-by-side comparison of two results
- Result Detail: Full diff viewer, logs, metrics, human notes/quality, and LLM judge judgements
- Judgements: Centralized view of all LLM judge assessments and pending judgements
Launch with:
uvx vibelab start start-cmd
Documentation
- SPEC.md - Product requirements
- PLAN.md - Implementation plan
- DEVELOPMENT.md - Development setup
- AGENTS.md - Instructions for AI coding agents
License
PolyForm Noncommercial 1.0.0 - Free to use for personal and non-commercial purposes. Commercial use and resale are not permitted.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vibelab-0.0.1.tar.gz.
File metadata
- Download URL: vibelab-0.0.1.tar.gz
- Upload date:
- Size: 23.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78b76b9c458257f0a3ed65f83267e8c039246cae6a1ee03d38cf466f242598aa
|
|
| MD5 |
70438c32f91ee56ff54ed81f6407ff9b
|
|
| BLAKE2b-256 |
020f04ccf914dc26d3efe13d760cae30611ad58a97a8dbce41298010a4d003ec
|
Provenance
The following attestation bundles were made for vibelab-0.0.1.tar.gz:
Publisher:
release.yml on tssweeney/vibelab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vibelab-0.0.1.tar.gz -
Subject digest:
78b76b9c458257f0a3ed65f83267e8c039246cae6a1ee03d38cf466f242598aa - Sigstore transparency entry: 763757138
- Sigstore integration time:
-
Permalink:
tssweeney/vibelab@14bcf46f4777293db97a2cb9ae562f45ff6b3c83 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tssweeney
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@14bcf46f4777293db97a2cb9ae562f45ff6b3c83 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file vibelab-0.0.1-py3-none-any.whl.
File metadata
- Download URL: vibelab-0.0.1-py3-none-any.whl
- Upload date:
- Size: 203.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06935a6b6a552095a23e931ac0435f637ee970ade96c2b80c4ead5b622a86983
|
|
| MD5 |
09640040408214df9879d78eaaea024f
|
|
| BLAKE2b-256 |
26c9ce68b5a886e6d299545d0e276ce17da2ef0c1aadfc98355ef780431b7665
|
Provenance
The following attestation bundles were made for vibelab-0.0.1-py3-none-any.whl:
Publisher:
release.yml on tssweeney/vibelab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vibelab-0.0.1-py3-none-any.whl -
Subject digest:
06935a6b6a552095a23e931ac0435f637ee970ade96c2b80c4ead5b622a86983 - Sigstore transparency entry: 763757139
- Sigstore integration time:
-
Permalink:
tssweeney/vibelab@14bcf46f4777293db97a2cb9ae562f45ff6b3c83 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tssweeney
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@14bcf46f4777293db97a2cb9ae562f45ff6b3c83 -
Trigger Event:
workflow_dispatch
-
Statement type: