Rigorous evaluation tool for comparing LLM coding agents

These details have not been verified by PyPI

Project description

VibeLab

⚠️ ALPHA RELEASE - USE WITH CAUTION
This project is in alpha and under active development. Breaking changes are expected and will occur. Use at your own risk.

A rigorous evaluation tool for comparing LLM coding agents.

Overview

VibeLab helps software engineers evaluate and compare LLM coding agents (Claude Code, OpenAI Codex, Cursor, Gemini CLI) through controlled experiments. Instead of ad-hoc "vibe checks," get reproducible, comparable results across different agent configurations.

Features

Comparative Runs: Test the same task across multiple agents side-by-side
Datasets: Organize scenarios into collections for batch evaluation
Result Tracking: Persistent storage of all runs with code diffs, logs, and metrics
Human Feedback: Add notes and quality scores (Perfect/Good/Workable/Bad) to evaluate run fitness
LLM Judges: Automatic graders that mimic human scores using few-shot examples
Judgements: LLM-generated assessments with alignment scores showing judge-human correlation
Web Dashboard: Visual comparison interface with diff viewer and dataset analytics
Extensible: Add new agent harnesses by implementing a simple protocol

Installation

VibeLab can be run directly with uvx without installation:

# Run commands directly with uvx (no installation needed)
uvx vibelab start start-cmd

Or install it permanently:

# Install with uv
uv tool install vibelab

# Or with pip
pip install vibelab

Prerequisites

Python 3.11+
Git
Agent CLIs you plan to use:

# Claude Code
npm install -g @anthropic-ai/claude-code

# OpenAI Codex
npm install -g @openai/codex

Quick Start

Run a comparison

# Compare Claude Code and Codex on the same task
uvx vibelab run run-cmd \
  --code github:owner/repo@main \
  --prompt "Add input validation to the login form" \
  --executor claude-code:anthropic:sonnet \
  --executor openai-codex:openai:gpt-4o

View results

# List recent results
uvx vibelab result list

# View a specific result
uvx vibelab result get <result-id>

# View the code diff
uvx vibelab result diff <result-id>

Launch the web UI

# Production mode (serves built frontend from package)
uvx vibelab start start-cmd

# Development mode (starts frontend dev server)
uvx vibelab start start-cmd --dev

# Run multiple worker processes (default: 1)
uvx vibelab start start-cmd --workers 2

# Verbose mode (API access logs + frontend logs)
uvx vibelab start start-cmd --verbose

CLI Reference

`vibelab run`

Execute a scenario against one or more executors.

uvx vibelab run run-cmd \
  --code <CODE_REF> \           # github:owner/repo@ref or local:/path
  --prompt <TEXT> \             # Task instructions
  --executor <SPEC> \           # harness:provider:model (repeatable)
  [--timeout <SECONDS>] \       # Default: 1800
  [--driver <DRIVER>]           # local (default), docker, modal

Options:

--code: Repository reference. Formats:
- github:owner/repo - Latest default branch
- github:owner/repo@branch - Specific branch
- github:owner/repo#commit - Specific commit
- local:/path/to/repo - Local directory
--prompt: Task instructions for the agent
--executor: Agent specification as harness:provider:model (can be repeated)
--timeout: Maximum execution time per agent in seconds (default: 1800)
--driver: Execution driver: local, docker, or modal

`vibelab scenario`

Manage scenarios (code + prompt combinations).

uvx vibelab scenario create --code <REF> --prompt <TEXT>
uvx vibelab scenario list [--limit N]
uvx vibelab scenario get <ID>

`vibelab dataset`

Manage datasets (collections of scenarios for batch evaluation).

uvx vibelab dataset create --name <NAME> [--description <TEXT>]
uvx vibelab dataset list [--limit N]
uvx vibelab dataset get <ID>
uvx vibelab dataset delete <ID>
uvx vibelab dataset add-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset remove-scenario --dataset <ID> --scenario <ID>
uvx vibelab dataset run --dataset <ID> --executor <SPEC> [--trials N] [--minimal]

Options:

--trials: Number of runs per scenario-executor pair (default: 1)
--minimal: Only run scenario-executor pairs that don't have completed results
--executor: Agent specification as harness:provider:model (can be repeated)

`vibelab result`

View and filter results, update notes and quality scores.

uvx vibelab result list [--scenario ID] [--executor SPEC] [--status STATUS]
uvx vibelab result get <ID>
uvx vibelab result diff <ID>
uvx vibelab result update-notes <ID> [--notes TEXT] [--clear]
uvx vibelab result update-quality <ID> [--quality 1-4] [--clear]
uvx vibelab result update <ID> [--notes TEXT] [--quality 1-4] [--clear-notes] [--clear-quality]

Filter options:

--scenario: Filter by scenario ID
--executor: Filter by executor spec (partial match)
--status: Filter by status: queued, running, completed, failed, timeout

Update commands:

update-notes: Add or update notes for a result. Use --notes "-" to read from stdin, or --clear to remove notes.
update-quality: Set quality score (1=Bad, 2=Workable, 3=Good, 4=Perfect). Use --clear to remove score.
update: Update both notes and quality in one command.

`vibelab executor`

List available agent configurations.

uvx vibelab executor list
uvx vibelab executor list --harness claude-code
uvx vibelab executor list --harness claude-code --provider anthropic

`vibelab start`

Launch the web server.

uvx vibelab start start-cmd [--port 8000] [--host 127.0.0.1] [--frontend-port 5173] [--dev/--no-dev] [--workers 1] [--verbose]

Options:

--port: Backend server port (default: 8000)
--host: Backend server host (default: 127.0.0.1)
--frontend-port: Frontend dev server port (default: 5173, only used with --dev)
--dev/--no-dev: Development mode with frontend dev server, or production mode serving static files (default: --no-dev, production mode)
--workers: Number of background worker processes (default: 1)
--verbose: Verbose logging (includes API request logs and frontend dev-server logs)

Configuration

VibeLab stores data in ~/.vibelab/ by default.

Environment Variables

Variable	Default	Description
`VIBELAB_HOME`	`~/.vibelab`	Data directory
`VIBELAB_DRIVER`	`local`	Default execution driver
`VIBELAB_TIMEOUT`	`1800`	Default timeout (seconds)
`VIBELAB_LOG_LEVEL`	`INFO`	Logging verbosity
`VIBELAB_SQLITE_BUSY_TIMEOUT_MS`	`5000`	SQLite busy timeout in ms

Task Queue

VibeLab uses a durable SQLite-backed task queue so that server restarts do not lose queued work.

Inspect tasks:
- GET /api/tasks
- GET /api/tasks/stats

Drivers (execution environments)

VibeLab supports multiple execution drivers:

local (default): git worktree isolation on the host
docker: OCI container execution via Docker
orbstack: OCI container execution via OrbStack (macOS)
modal: cloud execution via Modal

Installing driver dependencies

Drivers are registered conditionally (graceful degradation). Install extras to enable them:

uv sync --extra dev --extra docker
uv sync --extra dev --extra modal
# or all drivers:
uv sync --extra dev --extra all-drivers

Driver configuration env vars

Variable	Example	Description
`VIBELAB_OCI_RUNTIME`	`docker` / `orbstack` / `podman`	Force runtime selection for OCI drivers
`MODAL_TOKEN_ID`	`...`	Modal auth (optional if `~/.modal/token.json` exists)
`MODAL_TOKEN_SECRET`	`...`	Modal auth
`VIBELAB_CLAUDE_CODE_IMAGE`	`ghcr.io/me/claude-code:latest`	Override container image
`VIBELAB_OPENAI_CODEX_IMAGE`	`ghcr.io/me/openai-codex:latest`	Override container image
`VIBELAB_CURSOR_IMAGE`	`ghcr.io/me/cursor:latest`	Override container image
`VIBELAB_GEMINI_IMAGE`	`ghcr.io/me/gemini:latest`	Override container image

Container images

Default image names are vibelab/<harness>:latest. These may not exist in your registry; build and tag them yourself (see dockerfiles/) or override via the env vars above.

API Keys

Configure API keys for the agents you want to use:

# Claude Code
export ANTHROPIC_API_KEY=sk-ant-...

# OpenAI Codex
export OPENAI_API_KEY=sk-...

# Cursor
export CURSOR_API_KEY=your-cursor-api-key

Supported Agents

Harness	Provider	Models
`claude-code`	`anthropic`	`opus`, `sonnet`, `haiku`
`openai-codex`	`openai`	`gpt-4o`, `o3`, `o4-mini`
`cursor`	`cursor`	`composer-1`

Data Layout

~/.vibelab/
├── data.db                    # SQLite database
└── results/
    └── {result_id}/
        ├── patch.diff         # Git patch of changes
        ├── stdout.log
        ├── stderr.log
        └── harness/           # Harness-specific artifacts
            └── trajectory.json

Web UI

The web interface provides:

Dashboard: Recent scenarios and quick actions
Scenarios: Table view of all scenarios with metrics
Datasets: Collections of scenarios for batch evaluation
Dataset Analytics: Matrix view showing scenario-executor status across all combinations
Runs: Table view of all runs across scenarios with quality scores
Executors: Single table view of all executor tuples (harness:provider:model) with filtering, selection, and quick run creation
Run Creation: Form to configure and launch new comparisons
Scenario Detail: Table view of all results for a scenario with comparison, judge management, and judgement display
Compare Mode: Side-by-side comparison of two results
Result Detail: Full diff viewer, logs, metrics, human notes/quality, and LLM judge judgements
Judgements: Centralized view of all LLM judge assessments and pending judgements

Launch with:

uvx vibelab start start-cmd

Documentation

SPEC.md - Product requirements
DEVELOPMENT.md - Development setup
AGENTS.md - Instructions for AI coding agents

License

PolyForm Noncommercial 1.0.0 - Free to use for personal and non-commercial purposes. Commercial use and resale are not permitted.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.7

Jan 7, 2026

0.0.6

Dec 18, 2025

0.0.4

Dec 18, 2025

0.0.2

Dec 18, 2025

0.0.1

Dec 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibelab-0.0.7.tar.gz (27.6 MB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vibelab-0.0.7-py3-none-any.whl (430.1 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file vibelab-0.0.7.tar.gz.

File metadata

Download URL: vibelab-0.0.7.tar.gz
Upload date: Jan 7, 2026
Size: 27.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`256ad76f0c18884f6bc9b2420ae760b0eaafcb8dbb8566c3f6b02c9c22f49f9c`
MD5	`acfe542a92c38890af93dea803c615e5`
BLAKE2b-256	`c8be26e4a3c5ba1dfd089c4dff93aa785fb8974594091afa12ee04bde7325419`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.7.tar.gz:

Publisher: bump-and-release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vibelab-0.0.7.tar.gz
- Subject digest: 256ad76f0c18884f6bc9b2420ae760b0eaafcb8dbb8566c3f6b02c9c22f49f9c
- Sigstore transparency entry: 799059648
- Sigstore integration time: Jan 7, 2026
Source repository:
- Permalink: tssweeney/vibelab@5cc39b6f7adb119985ebd40ff57352a3f9c459cb
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tssweeney
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: bump-and-release.yml@5cc39b6f7adb119985ebd40ff57352a3f9c459cb
- Trigger Event: workflow_dispatch

File details

Details for the file vibelab-0.0.7-py3-none-any.whl.

File metadata

Download URL: vibelab-0.0.7-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 430.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibelab-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f5a60d8b2fbb60363b5e93785000b4e6d86bf26619c0a4789feb0e4da34d421`
MD5	`ded61f1afef2af160c65db020f131048`
BLAKE2b-256	`556179bc154bddbe6869fd0f617f85162a3858f37e35e368362ab6a85ccb7d05`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibelab-0.0.7-py3-none-any.whl:

Publisher: bump-and-release.yml on tssweeney/vibelab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vibelab-0.0.7-py3-none-any.whl
- Subject digest: 5f5a60d8b2fbb60363b5e93785000b4e6d86bf26619c0a4789feb0e4da34d421
- Sigstore transparency entry: 799059653
- Sigstore integration time: Jan 7, 2026
Source repository:
- Permalink: tssweeney/vibelab@5cc39b6f7adb119985ebd40ff57352a3f9c459cb
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tssweeney
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: bump-and-release.yml@5cc39b6f7adb119985ebd40ff57352a3f9c459cb
- Trigger Event: workflow_dispatch

vibelab 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

VibeLab

Overview

Features

Installation

Prerequisites

Quick Start

Run a comparison

View results

Launch the web UI

CLI Reference

vibelab run

vibelab scenario

vibelab dataset

vibelab result

vibelab executor

vibelab start

Configuration

Environment Variables

Task Queue

Drivers (execution environments)

Installing driver dependencies

Driver configuration env vars

Container images

API Keys

Supported Agents

Data Layout

Web UI

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`vibelab run`

`vibelab scenario`

`vibelab dataset`

`vibelab result`

`vibelab executor`

`vibelab start`