Skip to main content

Rigorous evaluation tool for comparing LLM coding agents

Project description

VibeLab

⚠️ ALPHA RELEASE - USE WITH CAUTION
This project is in alpha and under active development. Breaking changes are expected and will occur. Use at your own risk.

A rigorous evaluation tool for comparing LLM coding agents.

Overview

VibeLab helps software engineers evaluate and compare LLM coding agents (Claude Code, OpenAI Codex, Cursor, Gemini CLI) through controlled experiments. Instead of ad-hoc "vibe checks," get reproducible, comparable results across different agent configurations.

Features

  • Comparative Runs: Test the same task across multiple agents side-by-side
  • Datasets: Organize scenarios into collections for batch evaluation
  • Result Tracking: Persistent storage of all runs with code diffs, logs, and metrics
  • Human Feedback: Add notes and quality scores (Perfect/Good/Workable/Bad) to evaluate run fitness
  • LLM Judges: Automatic graders that mimic human scores using few-shot examples
  • Judgements: LLM-generated assessments with alignment scores showing judge-human correlation
  • Web Dashboard: Visual comparison interface with diff viewer and dataset analytics
  • Extensible: Add new agent harnesses by implementing a simple protocol

Installation

VibeLab can be run directly with uvx without installation:

# Run commands directly with uvx (no installation needed)
uvx vibelab start start-cmd

Or install it permanently:

# Install with uv
uv tool install vibelab

# Or with pip
pip install vibelab

Prerequisites

  • Python 3.11+
  • Git
  • Agent CLIs you plan to use:
# Claude Code
npm install -g @anthropic-ai/claude-code

# OpenAI Codex
npm install -g @openai/codex

Quick Start

Run a comparison

# Compare Claude Code and Codex on the same task
uvx vibelab run run-cmd \
  --code github:owner/repo@main \
  --prompt "Add input validation to the login form" \
  --executor claude-code:anthropic:sonnet \
  --executor openai-codex:openai:gpt-4o

View results

# List recent results
uvx vibelab result list

# View a specific result
uvx vibelab result get <result-id>

# View the code diff
uvx vibelab result diff <result-id>

Launch the web UI

# Production mode (serves built frontend from package)
uvx vibelab start start-cmd

# Development mode (starts frontend dev server)
uvx vibelab start start-cmd --dev

CLI Reference

vibelab run

Execute a scenario against one or more executors.

uvx vibelab run run-cmd \
  --code <CODE_REF> \           # github:owner/repo@ref or local:/path
  --prompt <TEXT> \             # Task instructions
  --executor <SPEC> \           # harness:provider:model (repeatable)
  [--timeout <SECONDS>] \       # Default: 1800
  [--driver <DRIVER>]           # local (default), docker, modal

Options:

  • --code: Repository reference. Formats:
    • github:owner/repo - Latest default branch
    • github:owner/repo@branch - Specific branch
    • github:owner/repo#commit - Specific commit
    • local:/path/to/repo - Local directory
  • --prompt: Task instructions for the agent
  • --executor: Agent specification as harness:provider:model (can be repeated)
  • --timeout: Maximum execution time per agent in seconds (default: 1800)
  • --driver: Execution driver: local, docker, or modal

vibelab scenario

Manage scenarios (code + prompt combinations).

uvx vibelab scenario create --code <REF> --prompt <TEXT>
uvx vibelab scenario list [--limit N]
uvx vibelab scenario get <ID>

vibelab dataset

Manage datasets (collections of scenarios for batch evaluation).

uvx vibelab dataset create --name <NAME> [--description <TEXT>]
uvx vibelab dataset list [--limit N]
uvx vibelab dataset get <ID>
uvx vibelab dataset delete <ID>
uvx vibelab dataset add-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset remove-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset run --dataset <ID> --executor <SPEC> [--trials N] [--minimal]

Options:

  • --trials: Number of runs per scenario-executor pair (default: 1)
  • --minimal: Only run scenario-executor pairs that don't have completed results
  • --executor: Agent specification as harness:provider:model (can be repeated)

vibelab result

View and filter results, update notes and quality scores.

uvx vibelab result list [--scenario ID] [--executor SPEC] [--status STATUS]
uvx vibelab result get <ID>
uvx vibelab result diff <ID>
uvx vibelab result update-notes <ID> [--notes TEXT] [--clear]
uvx vibelab result update-quality <ID> [--quality 1-4] [--clear]
uvx vibelab result update <ID> [--notes TEXT] [--quality 1-4] [--clear-notes] [--clear-quality]

Filter options:

  • --scenario: Filter by scenario ID
  • --executor: Filter by executor spec (partial match)
  • --status: Filter by status: queued, running, completed, failed, timeout

Update commands:

  • update-notes: Add or update notes for a result. Use --notes "-" to read from stdin, or --clear to remove notes.
  • update-quality: Set quality score (1=Bad, 2=Workable, 3=Good, 4=Perfect). Use --clear to remove score.
  • update: Update both notes and quality in one command.

vibelab executor

List available agent configurations.

uvx vibelab executor list
uvx vibelab executor list --harness claude-code
uvx vibelab executor list --harness claude-code --provider anthropic

vibelab start

Launch the web server.

uvx vibelab start start-cmd [--port 8000] [--host 127.0.0.1] [--frontend-port 5173] [--dev/--no-dev]

Options:

  • --port: Backend server port (default: 8000)
  • --host: Backend server host (default: 127.0.0.1)
  • --frontend-port: Frontend dev server port (default: 5173, only used with --dev)
  • --dev/--no-dev: Development mode with frontend dev server, or production mode serving static files (default: --no-dev, production mode)

Configuration

VibeLab stores data in ~/.vibelab/ by default.

Environment Variables

Variable Default Description
VIBELAB_HOME ~/.vibelab Data directory
VIBELAB_DRIVER local Default execution driver
VIBELAB_TIMEOUT 1800 Default timeout (seconds)
VIBELAB_LOG_LEVEL INFO Logging verbosity

API Keys

Configure API keys for the agents you want to use:

# Claude Code
export ANTHROPIC_API_KEY=sk-ant-...

# OpenAI Codex
export OPENAI_API_KEY=sk-...

# Cursor
export CURSOR_API_KEY=your-cursor-api-key

Supported Agents

Harness Provider Models
claude-code anthropic opus, sonnet, haiku
openai-codex openai gpt-4o, o3, o4-mini
cursor cursor composer-1

Data Layout

~/.vibelab/
├── data.db                    # SQLite database
└── results/
    └── {result_id}/
        ├── patch.diff         # Git patch of changes
        ├── stdout.log
        ├── stderr.log
        └── harness/           # Harness-specific artifacts
            └── trajectory.json

Web UI

The web interface provides:

  • Dashboard: Recent scenarios and quick actions
  • Scenarios: Table view of all scenarios with metrics
  • Datasets: Collections of scenarios for batch evaluation
  • Dataset Analytics: Matrix view showing scenario-executor status across all combinations
  • Runs: Table view of all runs across scenarios with quality scores
  • Executors: Single table view of all executor tuples (harness:provider:model) with filtering, selection, and quick run creation
  • Run Creation: Form to configure and launch new comparisons
  • Scenario Detail: Table view of all results for a scenario with comparison, judge management, and judgement display
  • Compare Mode: Side-by-side comparison of two results
  • Result Detail: Full diff viewer, logs, metrics, human notes/quality, and LLM judge judgements
  • Judgements: Centralized view of all LLM judge assessments and pending judgements

Launch with:

uvx vibelab start start-cmd

Documentation

License

PolyForm Noncommercial 1.0.0 - Free to use for personal and non-commercial purposes. Commercial use and resale are not permitted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibelab-0.0.1.tar.gz (23.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibelab-0.0.1-py3-none-any.whl (203.8 kB view details)

Uploaded Python 3

File details

Details for the file vibelab-0.0.1.tar.gz.

File metadata

  • Download URL: vibelab-0.0.1.tar.gz
  • Upload date:
  • Size: 23.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.1.tar.gz
Algorithm Hash digest
SHA256 78b76b9c458257f0a3ed65f83267e8c039246cae6a1ee03d38cf466f242598aa
MD5 70438c32f91ee56ff54ed81f6407ff9b
BLAKE2b-256 020f04ccf914dc26d3efe13d760cae30611ad58a97a8dbce41298010a4d003ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.1.tar.gz:

Publisher: release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibelab-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: vibelab-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 203.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 06935a6b6a552095a23e931ac0435f637ee970ade96c2b80c4ead5b622a86983
MD5 09640040408214df9879d78eaaea024f
BLAKE2b-256 26c9ce68b5a886e6d299545d0e276ce17da2ef0c1aadfc98355ef780431b7665

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.1-py3-none-any.whl:

Publisher: release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page