Skip to main content

A lightweight tool for generating annotated eval datasets and running LLM-as-judge evaluations

Project description

simboba

PyPI

     ( )
   .-~~~-.
  /       \
  |  ===  |
  | ::::: |
  |_:::::_|
    '---'

Lightweight eval tracking with LLM-as-judge. Run evals as Python scripts, track results as git-friendly JSON files, view in a web UI. Designed for 1-click setup with your favourite AI coding tool.

Installation

pip install simboba

Quick Start

boba init          # Create boba-evals/ folder with templates
boba magic         # Prompt for your AI tool to set up and run your first eval
boba run           # Run your evals (handles Docker automatically)
boba baseline      # Save run as baseline for regression detection
boba serve         # View results at http://localhost:8787

Commands

Command Description
boba init Create boba-evals/ folder with starter templates
boba magic Print detailed prompt for AI coding assistant
boba run [script] Run eval script (default: test_chat.py). Handles Docker automatically
boba baseline Save a run as baseline for regression detection
boba serve Start web UI to view results
boba datasets List all datasets
boba generate "description" Generate a dataset from a description
boba reset Clear run history (keeps datasets and baselines)

Writing Evals

Evals are Python scripts. Edit boba-evals/test_chat.py:

from simboba import Boba
from setup import get_context, cleanup

boba = Boba()

def agent(message: str) -> str:
    """Call your agent and return its response."""
    ctx = get_context()
    response = requests.post(
        "http://localhost:8000/api/chat",
        json={"user_id": ctx["user_id"], "message": message},
    )
    return response.json()["response"]

if __name__ == "__main__":
    try:
        # Option 1: Single eval
        boba.eval(
            input="Hello",
            output=agent("Hello"),
            expected="Should greet the user",
        )

        # Option 2: Run against a dataset
        # boba.run(agent, dataset="my-dataset")

        print("Done! Run 'boba serve' to view results.")
    finally:
        cleanup()

Regression Detection

Track regressions across code changes:

# Run evals and compare to baseline
boba run
# Output shows regressions: "REGRESSIONS: 2 cases now failing"

# Save current results as new baseline
boba baseline
# Commit to git for tracking
git add boba-evals/baselines/
git commit -m "Update eval baseline"

Creating Datasets

Via CLI

boba generate "A customer support chatbot for an e-commerce site"

Via Web UI

  1. boba serve
  2. Click "New Dataset" -> "Generate with AI"
  3. Enter a description of your agent and we'll create test cases for you.

Via API

from simboba import Boba
boba = Boba()
boba.run(agent, dataset="my-dataset")  # Uses dataset created above

Test Fixtures (setup.py)

Edit boba-evals/setup.py to create test data your agent needs:

def get_context():
    """Create test fixtures, return context dict."""
    user = create_test_user(email="eval@test.com")
    return {
        "user_id": user.id,
        "api_token": user.generate_token(),
    }

def cleanup():
    """Clean up test data after evals."""
    delete_test_users()

Environment Variables

Boba loads .env automatically. Set your LLM API key for judging (Claude Haiku 4.5 is the default):

ANTHROPIC_API_KEY=sk-ant-...   # Required for default model (Claude)
OPENAI_API_KEY=sk-...          # For OpenAI models
GEMINI_API_KEY=...             # For Gemini models

Note: Without an API key, boba falls back to a simple keyword-matching judge which is less accurate.

Project Structure

your-project/
├── boba-evals/
│   ├── datasets/           # Dataset JSON files (git tracked)
│   ├── baselines/          # Baseline results (git tracked)
│   ├── runs/               # Run history (gitignored)
│   ├── files/              # Uploaded attachments
│   ├── setup.py            # Test fixtures
│   ├── test_chat.py        # Your eval script
│   ├── settings.json       # Configuration
│   └── .boba.yaml          # Runtime config (docker vs local)
└── ...

Future Updates

  • File Uploads - Allow uploads via UI to help create datasets
  • Eval methods - Built-in evaluation strategies beyond LLM-as-judge
  • Cloud storage - Sync datasets and runs to the cloud for team collaboration

Development

To work on the web UI:

cd frontend
npm install
npm run dev      # Dev server with hot reload (proxies to localhost:8787)
npm run build    # Build to simboba/static/

Run boba serve in another terminal to start the backend.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simboba-0.1.3.tar.gz (200.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simboba-0.1.3-py3-none-any.whl (144.2 kB view details)

Uploaded Python 3

File details

Details for the file simboba-0.1.3.tar.gz.

File metadata

  • Download URL: simboba-0.1.3.tar.gz
  • Upload date:
  • Size: 200.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2ddee71496af05a0140de7b929a8c6c1c415573617401e69a52b23ee94c5270d
MD5 bd721cb67c76bc2ddb89333c4ac5b908
BLAKE2b-256 040bad6e0cf9c78a3015afcc6808a9f85b4c3a8a1a872a396b9bb5a148601164

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.3.tar.gz:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simboba-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: simboba-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 144.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 86dea59e0917217d0d25afe7b012d80ab5e4cdc4ea3c42dca952fe985a3adb53
MD5 1d7a5bc3dc34b4228e87be93af64dfe1
BLAKE2b-256 d8a67f5571f8ef63aa046ed774111904f0db89dbf7614e5a6b3cb20c9c117ef0

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page