Skip to main content

Rigorous evaluation tool for comparing LLM coding agents

Project description

VibeLab

⚠️ ALPHA RELEASE - USE WITH CAUTION
This project is in alpha and under active development. Breaking changes are expected and will occur. Use at your own risk.

A rigorous evaluation tool for comparing LLM coding agents.

Overview

VibeLab helps software engineers evaluate and compare LLM coding agents (Claude Code, OpenAI Codex, Cursor, Gemini CLI) through controlled experiments. Instead of ad-hoc "vibe checks," get reproducible, comparable results across different agent configurations.

Features

  • Comparative Runs: Test the same task across multiple agents side-by-side
  • Datasets: Organize scenarios into collections for batch evaluation
  • Result Tracking: Persistent storage of all runs with code diffs, logs, and metrics
  • Human Feedback: Add notes and quality scores (Perfect/Good/Workable/Bad) to evaluate run fitness
  • LLM Judges: Automatic graders that mimic human scores using few-shot examples
  • Judgements: LLM-generated assessments with alignment scores showing judge-human correlation
  • Web Dashboard: Visual comparison interface with diff viewer and dataset analytics
  • Extensible: Add new agent harnesses by implementing a simple protocol

Installation

VibeLab can be run directly with uvx without installation:

# Run commands directly with uvx (no installation needed)
uvx vibelab start start-cmd

Or install it permanently:

# Install with uv
uv tool install vibelab

# Or with pip
pip install vibelab

Prerequisites

  • Python 3.11+
  • Git
  • Agent CLIs you plan to use:
# Claude Code
npm install -g @anthropic-ai/claude-code

# OpenAI Codex
npm install -g @openai/codex

Quick Start

Run a comparison

# Compare Claude Code and Codex on the same task
uvx vibelab run run-cmd \
  --code github:owner/repo@main \
  --prompt "Add input validation to the login form" \
  --executor claude-code:anthropic:sonnet \
  --executor openai-codex:openai:gpt-4o

View results

# List recent results
uvx vibelab result list

# View a specific result
uvx vibelab result get <result-id>

# View the code diff
uvx vibelab result diff <result-id>

Launch the web UI

# Production mode (serves built frontend from package)
uvx vibelab start start-cmd

# Development mode (starts frontend dev server)
uvx vibelab start start-cmd --dev

# Run multiple worker processes (default: 1)
uvx vibelab start start-cmd --workers 2

# Verbose mode (API access logs + frontend logs)
uvx vibelab start start-cmd --verbose

CLI Reference

vibelab run

Execute a scenario against one or more executors.

uvx vibelab run run-cmd \
  --code <CODE_REF> \           # github:owner/repo@ref or local:/path
  --prompt <TEXT> \             # Task instructions
  --executor <SPEC> \           # harness:provider:model (repeatable)
  [--timeout <SECONDS>] \       # Default: 1800
  [--driver <DRIVER>]           # local (default), docker, modal

Options:

  • --code: Repository reference. Formats:
    • github:owner/repo - Latest default branch
    • github:owner/repo@branch - Specific branch
    • github:owner/repo#commit - Specific commit
    • local:/path/to/repo - Local directory
  • --prompt: Task instructions for the agent
  • --executor: Agent specification as harness:provider:model (can be repeated)
  • --timeout: Maximum execution time per agent in seconds (default: 1800)
  • --driver: Execution driver: local, docker, or modal

vibelab scenario

Manage scenarios (code + prompt combinations).

uvx vibelab scenario create --code <REF> --prompt <TEXT>
uvx vibelab scenario list [--limit N]
uvx vibelab scenario get <ID>

vibelab dataset

Manage datasets (collections of scenarios for batch evaluation).

uvx vibelab dataset create --name <NAME> [--description <TEXT>]
uvx vibelab dataset list [--limit N]
uvx vibelab dataset get <ID>
uvx vibelab dataset delete <ID>
uvx vibelab dataset add-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset remove-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset run --dataset <ID> --executor <SPEC> [--trials N] [--minimal]

Options:

  • --trials: Number of runs per scenario-executor pair (default: 1)
  • --minimal: Only run scenario-executor pairs that don't have completed results
  • --executor: Agent specification as harness:provider:model (can be repeated)

vibelab result

View and filter results, update notes and quality scores.

uvx vibelab result list [--scenario ID] [--executor SPEC] [--status STATUS]
uvx vibelab result get <ID>
uvx vibelab result diff <ID>
uvx vibelab result update-notes <ID> [--notes TEXT] [--clear]
uvx vibelab result update-quality <ID> [--quality 1-4] [--clear]
uvx vibelab result update <ID> [--notes TEXT] [--quality 1-4] [--clear-notes] [--clear-quality]

Filter options:

  • --scenario: Filter by scenario ID
  • --executor: Filter by executor spec (partial match)
  • --status: Filter by status: queued, running, completed, failed, timeout

Update commands:

  • update-notes: Add or update notes for a result. Use --notes "-" to read from stdin, or --clear to remove notes.
  • update-quality: Set quality score (1=Bad, 2=Workable, 3=Good, 4=Perfect). Use --clear to remove score.
  • update: Update both notes and quality in one command.

vibelab executor

List available agent configurations.

uvx vibelab executor list
uvx vibelab executor list --harness claude-code
uvx vibelab executor list --harness claude-code --provider anthropic

vibelab start

Launch the web server.

uvx vibelab start start-cmd [--port 8000] [--host 127.0.0.1] [--frontend-port 5173] [--dev/--no-dev] [--workers 1] [--verbose]

Options:

  • --port: Backend server port (default: 8000)
  • --host: Backend server host (default: 127.0.0.1)
  • --frontend-port: Frontend dev server port (default: 5173, only used with --dev)
  • --dev/--no-dev: Development mode with frontend dev server, or production mode serving static files (default: --no-dev, production mode)
  • --workers: Number of background worker processes (default: 1)
  • --verbose: Verbose logging (includes API request logs and frontend dev-server logs)

Configuration

VibeLab stores data in ~/.vibelab/ by default.

Environment Variables

Variable Default Description
VIBELAB_HOME ~/.vibelab Data directory
VIBELAB_DRIVER local Default execution driver
VIBELAB_TIMEOUT 1800 Default timeout (seconds)
VIBELAB_LOG_LEVEL INFO Logging verbosity
VIBELAB_SQLITE_BUSY_TIMEOUT_MS 5000 SQLite busy timeout in ms

Task Queue

VibeLab uses a durable SQLite-backed task queue so that server restarts do not lose queued work.

  • Inspect tasks:
    • GET /api/tasks
    • GET /api/tasks/stats

Drivers (execution environments)

VibeLab supports multiple execution drivers:

  • local (default): git worktree isolation on the host
  • docker: OCI container execution via Docker
  • orbstack: OCI container execution via OrbStack (macOS)
  • modal: cloud execution via Modal

Installing driver dependencies

Drivers are registered conditionally (graceful degradation). Install extras to enable them:

uv sync --extra dev --extra docker
uv sync --extra dev --extra modal
# or all drivers:
uv sync --extra dev --extra all-drivers

Driver configuration env vars

Variable Example Description
VIBELAB_OCI_RUNTIME docker / orbstack / podman Force runtime selection for OCI drivers
MODAL_TOKEN_ID ... Modal auth (optional if ~/.modal/token.json exists)
MODAL_TOKEN_SECRET ... Modal auth
VIBELAB_CLAUDE_CODE_IMAGE ghcr.io/me/claude-code:latest Override container image
VIBELAB_OPENAI_CODEX_IMAGE ghcr.io/me/openai-codex:latest Override container image
VIBELAB_CURSOR_IMAGE ghcr.io/me/cursor:latest Override container image
VIBELAB_GEMINI_IMAGE ghcr.io/me/gemini:latest Override container image

Container images

Default image names are vibelab/<harness>:latest. These may not exist in your registry; build and tag them yourself (see dockerfiles/) or override via the env vars above.

API Keys

Configure API keys for the agents you want to use:

# Claude Code
export ANTHROPIC_API_KEY=sk-ant-...

# OpenAI Codex
export OPENAI_API_KEY=sk-...

# Cursor
export CURSOR_API_KEY=your-cursor-api-key

Supported Agents

Harness Provider Models
claude-code anthropic opus, sonnet, haiku
openai-codex openai gpt-4o, o3, o4-mini
cursor cursor composer-1

Data Layout

~/.vibelab/
├── data.db                    # SQLite database
└── results/
    └── {result_id}/
        ├── patch.diff         # Git patch of changes
        ├── stdout.log
        ├── stderr.log
        └── harness/           # Harness-specific artifacts
            └── trajectory.json

Web UI

The web interface provides:

  • Dashboard: Recent scenarios and quick actions
  • Scenarios: Table view of all scenarios with metrics
  • Datasets: Collections of scenarios for batch evaluation
  • Dataset Analytics: Matrix view showing scenario-executor status across all combinations
  • Runs: Table view of all runs across scenarios with quality scores
  • Executors: Single table view of all executor tuples (harness:provider:model) with filtering, selection, and quick run creation
  • Run Creation: Form to configure and launch new comparisons
  • Scenario Detail: Table view of all results for a scenario with comparison, judge management, and judgement display
  • Compare Mode: Side-by-side comparison of two results
  • Result Detail: Full diff viewer, logs, metrics, human notes/quality, and LLM judge judgements
  • Judgements: Centralized view of all LLM judge assessments and pending judgements

Launch with:

uvx vibelab start start-cmd

Documentation

License

PolyForm Noncommercial 1.0.0 - Free to use for personal and non-commercial purposes. Commercial use and resale are not permitted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibelab-0.0.2.tar.gz (26.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibelab-0.0.2-py3-none-any.whl (111.6 kB view details)

Uploaded Python 3

File details

Details for the file vibelab-0.0.2.tar.gz.

File metadata

  • Download URL: vibelab-0.0.2.tar.gz
  • Upload date:
  • Size: 26.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.2.tar.gz
Algorithm Hash digest
SHA256 83af61693c2381864f8aa21c8eac5e88fbb6d32f1c8bff9d3bdd38605d47634f
MD5 97ae543cc108de497d05da56cd1e7215
BLAKE2b-256 9a45ad3a968f76901df8f03db65702f4635b7b32ae514810ea36b14ab60a9fd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.2.tar.gz:

Publisher: release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibelab-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: vibelab-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 111.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 71a6aa2743754d72b22fc432c9de8a568e3d801ae050a0f88b71274e968a93cb
MD5 1201d4c7dc95161fbecc76403b4562e7
BLAKE2b-256 1396598c8e1634fb2e8aaf6fbc0eb915caf8dcdf81b95921cf3077f4ebb0bc20

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.2-py3-none-any.whl:

Publisher: release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page