Skip to main content

A lightweight tool for generating annotated eval datasets and running LLM-as-judge evaluations

Project description

boba

     ( )
   .-~~~-.
  /       \
  |  ===  |
  | ::::: |
  |_:::::_|
    '---'

LLM eval datasets & judge. Generate test cases, run evals, and judge results with AI.

Installation

pip install simboba

Quick Start

boba init      # Create evals/ folder with template
boba serve     # Start web UI at http://localhost:8787

Commands

Command Description
boba init Create evals/ folder with starter template
boba serve Start web UI (auto-loads evals from evals/)
boba serve --config path Load specific eval file or folder
boba test Test connection to your agent
boba test --eval name Test a specific eval
boba evals List loaded evals and show any errors
boba run --dataset name Run evals headlessly (for CI)
boba datasets List all datasets
boba generate "description" Generate dataset from CLI
boba export --dataset name -o file.json Export dataset
boba import -i file.json Import dataset
boba reset Delete database (all data)

Testing Your Setup

Before running full evals, verify your eval function can connect to your agent:

# Test with default message
boba test

# Test a specific eval
boba test --eval my-agent

# Test with a custom message
boba test -m "What is the status of order 123?"

This sends a test message through your eval function and shows the response (or error). Use this to debug connection issues before running full evaluations.

You can also test from the web UI: click "New Run" and then "Test connection first →".


Generating Datasets

Datasets are collections of test cases. Each case has:

  • inputs: Conversation messages (user/assistant turns)
  • expected_outcome: What the agent should do

How to Generate Good Datasets

The key to good test cases is understanding your app first. Before generating:

  1. Study your app's logic - Read the code, understand the flows
  2. Identify key user journeys - What are the main things users do?
  3. Note edge cases - What can go wrong? What are the limits?
  4. Describe clearly - The better your description, the better the test cases

Writing a Good Description

Bad:

A chatbot for construction sites

Good:

A WhatsApp-based AI assistant for construction site staff.

KEY FLOWS:
1. Daily site logs - User sends photos/voice notes describing work progress,
   weather, labor count, equipment. Agent should extract structured data and
   confirm what was logged.

2. Safety incidents - User reports an incident. Agent must collect: location,
   time, people involved, injuries, witnesses. Must escalate serious incidents.

3. Material requests - User requests materials. Agent checks inventory,
   suggests alternatives if unavailable, creates purchase order.

EDGE CASES:
- User sends unclear voice note - agent should ask for clarification
- User reports injury - agent must immediately escalate, not just log
- User requests unavailable material - agent should suggest alternatives

USERS: Site managers (tech-savvy), foremen (moderate), workers (basic phones)

Via UI

  1. Run boba serve
  2. Click "New Dataset"
  3. Choose "Generate with AI"
  4. Paste your detailed description
  5. Review and edit generated cases

Via CLI

boba generate "Your detailed description here"

Best Practices

  1. Study your code first - Read handlers, prompts, business logic
  2. Map user flows - List the main journeys step by step
  3. Include multi-turn conversations - Real users have back-and-forth dialogue
  4. Be specific in expected outcomes - "Should ask for order number" not "Should help"
  5. Test different user types - New users, experts, frustrated users
  6. Include failure cases - What should happen when the agent can't help?
  7. Cover edge cases - Invalid inputs, missing data, system limits

Writing Evals

Evals connect your agent/API to boba for testing. Create Python files in the evals/ folder.

Basic Structure

from simboba import Eval

def my_agent(messages):
    """
    Called for each test case.

    Args:
        messages: List of conversation messages
            [{"role": "user", "message": "...", "attachments": []}, ...]

    Returns:
        Agent's response as a string
    """
    # Call your API and return the response
    response = requests.post("http://localhost:8000/chat", json={
        "messages": [{"role": m["role"], "content": m["message"]} for m in messages]
    })
    return response.json()["response"]

# Register the eval
my_eval = Eval(name="my-agent", fn=my_agent)

Common Patterns

Direct Python Call (simplest):

# Import your agent directly - no HTTP/auth needed
from my_project.agent import generate_response

def my_agent(messages):
    # Call your function directly
    return generate_response(messages[-1]["message"])

my_eval = Eval(name="my-agent", fn=my_agent)

HTTP API:

def my_agent(messages):
    resp = requests.post("http://localhost:8000/chat", json={"messages": messages})
    return resp.json()["response"]

Async/Webhook API:

def my_agent(messages):
    # Trigger webhook
    job = requests.post("http://localhost:8000/webhook", json={"message": messages[-1]})
    job_id = job.json()["job_id"]

    # Poll for result
    for _ in range(30):
        time.sleep(1)
        result = requests.get(f"http://localhost:8000/jobs/{job_id}")
        if result.json()["status"] == "complete":
            return result.json()["response"]

    raise TimeoutError("Agent didn't respond")

With Auth/Setup:

# Setup runs once when file loads
API_KEY = os.environ["MY_API_KEY"]
TEST_USER = create_test_user()

def my_agent(messages):
    resp = requests.post(
        "http://localhost:8000/chat",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"user_id": TEST_USER["id"], "messages": messages}
    )
    return resp.json()["response"]

my_eval = Eval(name="my-agent", fn=my_agent)

Running Evals

Via UI

  1. Run boba serve
  2. Go to "Runs" tab
  3. Click "New Run"
  4. Select dataset and eval
  5. View results with pass/fail and reasoning

Via CLI (for CI)

boba run --dataset my-dataset

How Judging Works

  1. Your eval function receives the test case inputs
  2. Your function calls your agent and returns its response
  3. An LLM judge compares the response against the expected outcome
  4. Judge returns pass/fail with reasoning

Environment Variables

Boba automatically loads .env files from the current directory.

Variable Description
ANTHROPIC_API_KEY For Claude models (generation & judging)
OPENAI_API_KEY For OpenAI models
GEMINI_API_KEY For Gemini models

Environment Variables in Eval Files

Important: Validate environment variables inside your eval function, not at module level. This prevents silent import failures and gives clear error messages.

# BAD - fails silently if env var is missing
API_KEY = os.environ["MY_API_KEY"]  # Module level = import error

def my_agent(messages):
    ...

# GOOD - clear error when the eval actually runs
def my_agent(messages):
    api_key = os.environ.get("MY_API_KEY")
    if not api_key:
        raise ValueError("MY_API_KEY not set")
    ...

If your eval files aren't showing up, run boba evals to see import errors.


Project Structure

After running boba init:

your-project/
├── evals/
│   ├── .gitignore     # Ignores database
│   ├── README.md      # This file
│   ├── example.py     # Starter template
│   └── simboba.db     # Database (created on first run, not tracked)
└── ...

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simboba-0.1.0.tar.gz (45.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simboba-0.1.0-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file simboba-0.1.0.tar.gz.

File metadata

  • Download URL: simboba-0.1.0.tar.gz
  • Upload date:
  • Size: 45.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f4396474d9dc5e5f29c112e02f6984f3c0aae3cacd034848f6be07b484673ef
MD5 ae3df6b378ac49b70bb6ff12a9447f4c
BLAKE2b-256 c924df547d450d2de4ec5ed6d1fb7ad4f93d8b595a7379f4cb1dff087e960dc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.0.tar.gz:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simboba-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: simboba-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0febfc138193dc80cc6df09aaf11041adf9b73cc6f92f0e24321dce12ac570a0
MD5 b7d21a4adbd6e5904db1bf2646402a22
BLAKE2b-256 cd4ed0362b92c7a8f62b4f37867195c795cd881efeac1c835937e9d9a4cfa9ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page