Skip to main content

Rigorous evaluation tool for comparing LLM coding agents

Project description

VibeLab

⚠️ ALPHA RELEASE - USE WITH CAUTION
This project is in alpha and under active development. Breaking changes are expected and will occur. Use at your own risk.

A rigorous evaluation tool for comparing LLM coding agents.

Overview

VibeLab helps software engineers evaluate and compare LLM coding agents (Claude Code, OpenAI Codex, Cursor, Gemini CLI) through controlled experiments. Instead of ad-hoc "vibe checks," get reproducible, comparable results across different agent configurations.

Features

  • Comparative Runs: Test the same task across multiple agents side-by-side
  • Datasets: Organize scenarios into collections for batch evaluation
  • Result Tracking: Persistent storage of all runs with code diffs, logs, and metrics
  • Human Feedback: Add notes and quality scores (Perfect/Good/Workable/Bad) to evaluate run fitness
  • LLM Judges: Automatic graders that mimic human scores using few-shot examples
  • Judgements: LLM-generated assessments with alignment scores showing judge-human correlation
  • Web Dashboard: Visual comparison interface with diff viewer and dataset analytics
  • Extensible: Add new agent harnesses by implementing a simple protocol

Installation

VibeLab can be run directly with uvx without installation:

# Run commands directly with uvx (no installation needed)
uvx vibelab start start-cmd

Or install it permanently:

# Install with uv
uv tool install vibelab

# Or with pip
pip install vibelab

Prerequisites

  • Python 3.11+
  • Git
  • Agent CLIs you plan to use:
# Claude Code
npm install -g @anthropic-ai/claude-code

# OpenAI Codex
npm install -g @openai/codex

Quick Start

Run a comparison

# Compare Claude Code and Codex on the same task
uvx vibelab run run-cmd \
  --code github:owner/repo@main \
  --prompt "Add input validation to the login form" \
  --executor claude-code:anthropic:sonnet \
  --executor openai-codex:openai:gpt-4o

View results

# List recent results
uvx vibelab result list

# View a specific result
uvx vibelab result get <result-id>

# View the code diff
uvx vibelab result diff <result-id>

Launch the web UI

# Production mode (serves built frontend from package)
uvx vibelab start start-cmd

# Development mode (starts frontend dev server)
uvx vibelab start start-cmd --dev

# Run multiple worker processes (default: 1)
uvx vibelab start start-cmd --workers 2

# Verbose mode (API access logs + frontend logs)
uvx vibelab start start-cmd --verbose

CLI Reference

vibelab run

Execute a scenario against one or more executors.

uvx vibelab run run-cmd \
  --code <CODE_REF> \           # github:owner/repo@ref or local:/path
  --prompt <TEXT> \             # Task instructions
  --executor <SPEC> \           # harness:provider:model (repeatable)
  [--timeout <SECONDS>] \       # Default: 1800
  [--driver <DRIVER>]           # local (default), docker, modal

Options:

  • --code: Repository reference. Formats:
    • github:owner/repo - Latest default branch
    • github:owner/repo@branch - Specific branch
    • github:owner/repo#commit - Specific commit
    • local:/path/to/repo - Local directory
  • --prompt: Task instructions for the agent
  • --executor: Agent specification as harness:provider:model (can be repeated)
  • --timeout: Maximum execution time per agent in seconds (default: 1800)
  • --driver: Execution driver: local, docker, or modal

vibelab scenario

Manage scenarios (code + prompt combinations).

uvx vibelab scenario create --code <REF> --prompt <TEXT>
uvx vibelab scenario list [--limit N]
uvx vibelab scenario get <ID>

vibelab dataset

Manage datasets (collections of scenarios for batch evaluation).

uvx vibelab dataset create --name <NAME> [--description <TEXT>]
uvx vibelab dataset list [--limit N]
uvx vibelab dataset get <ID>
uvx vibelab dataset delete <ID>
uvx vibelab dataset add-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset remove-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset run --dataset <ID> --executor <SPEC> [--trials N] [--minimal]

Options:

  • --trials: Number of runs per scenario-executor pair (default: 1)
  • --minimal: Only run scenario-executor pairs that don't have completed results
  • --executor: Agent specification as harness:provider:model (can be repeated)

vibelab result

View and filter results, update notes and quality scores.

uvx vibelab result list [--scenario ID] [--executor SPEC] [--status STATUS]
uvx vibelab result get <ID>
uvx vibelab result diff <ID>
uvx vibelab result update-notes <ID> [--notes TEXT] [--clear]
uvx vibelab result update-quality <ID> [--quality 1-4] [--clear]
uvx vibelab result update <ID> [--notes TEXT] [--quality 1-4] [--clear-notes] [--clear-quality]

Filter options:

  • --scenario: Filter by scenario ID
  • --executor: Filter by executor spec (partial match)
  • --status: Filter by status: queued, running, completed, failed, timeout

Update commands:

  • update-notes: Add or update notes for a result. Use --notes "-" to read from stdin, or --clear to remove notes.
  • update-quality: Set quality score (1=Bad, 2=Workable, 3=Good, 4=Perfect). Use --clear to remove score.
  • update: Update both notes and quality in one command.

vibelab executor

List available agent configurations.

uvx vibelab executor list
uvx vibelab executor list --harness claude-code
uvx vibelab executor list --harness claude-code --provider anthropic

vibelab start

Launch the web server.

uvx vibelab start start-cmd [--port 8000] [--host 127.0.0.1] [--frontend-port 5173] [--dev/--no-dev] [--workers 1] [--verbose]

Options:

  • --port: Backend server port (default: 8000)
  • --host: Backend server host (default: 127.0.0.1)
  • --frontend-port: Frontend dev server port (default: 5173, only used with --dev)
  • --dev/--no-dev: Development mode with frontend dev server, or production mode serving static files (default: --no-dev, production mode)
  • --workers: Number of background worker processes (default: 1)
  • --verbose: Verbose logging (includes API request logs and frontend dev-server logs)

Configuration

VibeLab stores data in ~/.vibelab/ by default.

Environment Variables

Variable Default Description
VIBELAB_HOME ~/.vibelab Data directory
VIBELAB_DRIVER local Default execution driver
VIBELAB_TIMEOUT 1800 Default timeout (seconds)
VIBELAB_LOG_LEVEL INFO Logging verbosity
VIBELAB_SQLITE_BUSY_TIMEOUT_MS 5000 SQLite busy timeout in ms

Task Queue

VibeLab uses a durable SQLite-backed task queue so that server restarts do not lose queued work.

  • Inspect tasks:
    • GET /api/tasks
    • GET /api/tasks/stats

Drivers (execution environments)

VibeLab supports multiple execution drivers:

  • local (default): git worktree isolation on the host
  • docker: OCI container execution via Docker
  • orbstack: OCI container execution via OrbStack (macOS)
  • modal: cloud execution via Modal

Installing driver dependencies

Drivers are registered conditionally (graceful degradation). Install extras to enable them:

uv sync --extra dev --extra docker
uv sync --extra dev --extra modal
# or all drivers:
uv sync --extra dev --extra all-drivers

Driver configuration env vars

Variable Example Description
VIBELAB_OCI_RUNTIME docker / orbstack / podman Force runtime selection for OCI drivers
MODAL_TOKEN_ID ... Modal auth (optional if ~/.modal/token.json exists)
MODAL_TOKEN_SECRET ... Modal auth
VIBELAB_CLAUDE_CODE_IMAGE ghcr.io/me/claude-code:latest Override container image
VIBELAB_OPENAI_CODEX_IMAGE ghcr.io/me/openai-codex:latest Override container image
VIBELAB_CURSOR_IMAGE ghcr.io/me/cursor:latest Override container image
VIBELAB_GEMINI_IMAGE ghcr.io/me/gemini:latest Override container image

Container images

Default image names are vibelab/<harness>:latest. These may not exist in your registry; build and tag them yourself (see dockerfiles/) or override via the env vars above.

API Keys

Configure API keys for the agents you want to use:

# Claude Code
export ANTHROPIC_API_KEY=sk-ant-...

# OpenAI Codex
export OPENAI_API_KEY=sk-...

# Cursor
export CURSOR_API_KEY=your-cursor-api-key

Supported Agents

Harness Provider Models
claude-code anthropic opus, sonnet, haiku
openai-codex openai gpt-4o, o3, o4-mini
cursor cursor composer-1

Data Layout

~/.vibelab/
├── data.db                    # SQLite database
└── results/
    └── {result_id}/
        ├── patch.diff         # Git patch of changes
        ├── stdout.log
        ├── stderr.log
        └── harness/           # Harness-specific artifacts
            └── trajectory.json

Web UI

The web interface provides:

  • Dashboard: Recent scenarios and quick actions
  • Scenarios: Table view of all scenarios with metrics
  • Datasets: Collections of scenarios for batch evaluation
  • Dataset Analytics: Matrix view showing scenario-executor status across all combinations
  • Runs: Table view of all runs across scenarios with quality scores
  • Executors: Single table view of all executor tuples (harness:provider:model) with filtering, selection, and quick run creation
  • Run Creation: Form to configure and launch new comparisons
  • Scenario Detail: Table view of all results for a scenario with comparison, judge management, and judgement display
  • Compare Mode: Side-by-side comparison of two results
  • Result Detail: Full diff viewer, logs, metrics, human notes/quality, and LLM judge judgements
  • Judgements: Centralized view of all LLM judge assessments and pending judgements

Launch with:

uvx vibelab start start-cmd

Documentation

License

PolyForm Noncommercial 1.0.0 - Free to use for personal and non-commercial purposes. Commercial use and resale are not permitted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibelab-0.0.7.tar.gz (27.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibelab-0.0.7-py3-none-any.whl (430.1 kB view details)

Uploaded Python 3

File details

Details for the file vibelab-0.0.7.tar.gz.

File metadata

  • Download URL: vibelab-0.0.7.tar.gz
  • Upload date:
  • Size: 27.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.7.tar.gz
Algorithm Hash digest
SHA256 256ad76f0c18884f6bc9b2420ae760b0eaafcb8dbb8566c3f6b02c9c22f49f9c
MD5 acfe542a92c38890af93dea803c615e5
BLAKE2b-256 c8be26e4a3c5ba1dfd089c4dff93aa785fb8974594091afa12ee04bde7325419

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.7.tar.gz:

Publisher: bump-and-release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibelab-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: vibelab-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 430.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5f5a60d8b2fbb60363b5e93785000b4e6d86bf26619c0a4789feb0e4da34d421
MD5 ded61f1afef2af160c65db020f131048
BLAKE2b-256 556179bc154bddbe6869fd0f617f85162a3858f37e35e368362ab6a85ccb7d05

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.7-py3-none-any.whl:

Publisher: bump-and-release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page