A lightweight tool for generating annotated eval datasets and running LLM-as-judge evaluations

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

boba

     ( )
   .-~~~-.
  /       \
  |  ===  |
  | ::::: |
  |_:::::_|
    '---'

LLM eval datasets & judge. Generate test cases, run evals, and judge results with AI.

Installation

pip install simboba

Quick Start

boba init      # Create evals/ folder with template
boba serve     # Start web UI at http://localhost:8787

Commands

Command	Description
`boba init`	Create `evals/` folder with starter template
`boba serve`	Start web UI (auto-loads evals from `evals/`)
`boba serve --config path`	Load specific eval file or folder
`boba test`	Test connection to your agent
`boba test --eval name`	Test a specific eval
`boba evals`	List loaded evals and show any errors
`boba run --dataset name`	Run evals headlessly (for CI)
`boba datasets`	List all datasets
`boba generate "description"`	Generate dataset from CLI
`boba export --dataset name -o file.json`	Export dataset
`boba import -i file.json`	Import dataset
`boba reset`	Delete database (all data)

Testing Your Setup

Before running full evals, verify your eval function can connect to your agent:

# Test with default message
boba test

# Test a specific eval
boba test --eval my-agent

# Test with a custom message
boba test -m "What is the status of order 123?"

This sends a test message through your eval function and shows the response (or error). Use this to debug connection issues before running full evaluations.

You can also test from the web UI: click "New Run" and then "Test connection first →".

Generating Datasets

Datasets are collections of test cases. Each case has:

inputs: Conversation messages (user/assistant turns)
expected_outcome: What the agent should do

How to Generate Good Datasets

The key to good test cases is understanding your app first. Before generating:

Study your app's logic - Read the code, understand the flows
Identify key user journeys - What are the main things users do?
Note edge cases - What can go wrong? What are the limits?
Describe clearly - The better your description, the better the test cases

Writing a Good Description

Bad:

A chatbot for construction sites

Good:

A WhatsApp-based AI assistant for construction site staff.

KEY FLOWS:
1. Daily site logs - User sends photos/voice notes describing work progress,
   weather, labor count, equipment. Agent should extract structured data and
   confirm what was logged.

2. Safety incidents - User reports an incident. Agent must collect: location,
   time, people involved, injuries, witnesses. Must escalate serious incidents.

3. Material requests - User requests materials. Agent checks inventory,
   suggests alternatives if unavailable, creates purchase order.

EDGE CASES:
- User sends unclear voice note - agent should ask for clarification
- User reports injury - agent must immediately escalate, not just log
- User requests unavailable material - agent should suggest alternatives

USERS: Site managers (tech-savvy), foremen (moderate), workers (basic phones)

Via UI

Run boba serve
Click "New Dataset"
Choose "Generate with AI"
Paste your detailed description
Review and edit generated cases

Via CLI

boba generate "Your detailed description here"

Best Practices

Study your code first - Read handlers, prompts, business logic
Map user flows - List the main journeys step by step
Include multi-turn conversations - Real users have back-and-forth dialogue
Be specific in expected outcomes - "Should ask for order number" not "Should help"
Test different user types - New users, experts, frustrated users
Include failure cases - What should happen when the agent can't help?
Cover edge cases - Invalid inputs, missing data, system limits

Writing Evals

Evals connect your agent/API to boba for testing. Create Python files in the evals/ folder.

Basic Structure

from simboba import Eval

def my_agent(messages):
    """
    Called for each test case.

    Args:
        messages: List of conversation messages
            [{"role": "user", "message": "...", "attachments": []}, ...]

    Returns:
        Agent's response as a string
    """
    # Call your API and return the response
    response = requests.post("http://localhost:8000/chat", json={
        "messages": [{"role": m["role"], "content": m["message"]} for m in messages]
    })
    return response.json()["response"]

# Register the eval
my_eval = Eval(name="my-agent", fn=my_agent)

Common Patterns

Direct Python Call (simplest):

# Import your agent directly - no HTTP/auth needed
from my_project.agent import generate_response

def my_agent(messages):
    # Call your function directly
    return generate_response(messages[-1]["message"])

my_eval = Eval(name="my-agent", fn=my_agent)

HTTP API:

def my_agent(messages):
    resp = requests.post("http://localhost:8000/chat", json={"messages": messages})
    return resp.json()["response"]

Async/Webhook API:

def my_agent(messages):
    # Trigger webhook
    job = requests.post("http://localhost:8000/webhook", json={"message": messages[-1]})
    job_id = job.json()["job_id"]

    # Poll for result
    for _ in range(30):
        time.sleep(1)
        result = requests.get(f"http://localhost:8000/jobs/{job_id}")
        if result.json()["status"] == "complete":
            return result.json()["response"]

    raise TimeoutError("Agent didn't respond")

With Auth/Setup:

# Setup runs once when file loads
API_KEY = os.environ["MY_API_KEY"]
TEST_USER = create_test_user()

def my_agent(messages):
    resp = requests.post(
        "http://localhost:8000/chat",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"user_id": TEST_USER["id"], "messages": messages}
    )
    return resp.json()["response"]

my_eval = Eval(name="my-agent", fn=my_agent)

Running Evals

Via UI

Run boba serve
Go to "Runs" tab
Click "New Run"
Select dataset and eval
View results with pass/fail and reasoning

Via CLI (for CI)

boba run --dataset my-dataset

How Judging Works

Your eval function receives the test case inputs
Your function calls your agent and returns its response
An LLM judge compares the response against the expected outcome
Judge returns pass/fail with reasoning

Environment Variables

Boba automatically loads .env files from the current directory.

Variable	Description
`ANTHROPIC_API_KEY`	For Claude models (generation & judging)
`OPENAI_API_KEY`	For OpenAI models
`GEMINI_API_KEY`	For Gemini models

Environment Variables in Eval Files

Important: Validate environment variables inside your eval function, not at module level. This prevents silent import failures and gives clear error messages.

# BAD - fails silently if env var is missing
API_KEY = os.environ["MY_API_KEY"]  # Module level = import error

def my_agent(messages):
    ...

# GOOD - clear error when the eval actually runs
def my_agent(messages):
    api_key = os.environ.get("MY_API_KEY")
    if not api_key:
        raise ValueError("MY_API_KEY not set")
    ...

If your eval files aren't showing up, run boba evals to see import errors.

Project Structure

After running boba init:

your-project/
├── evals/
│   ├── .gitignore     # Ignores database
│   ├── README.md      # This file
│   ├── example.py     # Starter template
│   └── simboba.db     # Database (created on first run, not tracked)
└── ...

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ntkris

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Feb 17, 2026

0.1.8

Jan 26, 2026

0.1.7

Dec 29, 2025

0.1.6

Dec 27, 2025

0.1.5

Dec 23, 2025

0.1.4

Dec 22, 2025

0.1.3

Dec 22, 2025

0.1.2

Dec 21, 2025

0.1.1

Dec 21, 2025

This version

0.1.0

Dec 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simboba-0.1.0.tar.gz (45.6 kB view details)

Uploaded Dec 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simboba-0.1.0-py3-none-any.whl (44.5 kB view details)

Uploaded Dec 20, 2025 Python 3

File details

Details for the file simboba-0.1.0.tar.gz.

File metadata

Download URL: simboba-0.1.0.tar.gz
Upload date: Dec 20, 2025
Size: 45.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9f4396474d9dc5e5f29c112e02f6984f3c0aae3cacd034848f6be07b484673ef`
MD5	`ae3df6b378ac49b70bb6ff12a9447f4c`
BLAKE2b-256	`c924df547d450d2de4ec5ed6d1fb7ad4f93d8b595a7379f4cb1dff087e960dc0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.0.tar.gz:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: simboba-0.1.0.tar.gz
- Subject digest: 9f4396474d9dc5e5f29c112e02f6984f3c0aae3cacd034848f6be07b484673ef
- Sigstore transparency entry: 774231535
- Sigstore integration time: Dec 20, 2025
Source repository:
- Permalink: ntkris/simboba@bb193840b61c918c8346878db1b767f8e505692d
- Branch / Tag: refs/tags/Release
- Owner: https://github.com/ntkris
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bb193840b61c918c8346878db1b767f8e505692d
- Trigger Event: release

File details

Details for the file simboba-0.1.0-py3-none-any.whl.

File metadata

Download URL: simboba-0.1.0-py3-none-any.whl
Upload date: Dec 20, 2025
Size: 44.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simboba-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0febfc138193dc80cc6df09aaf11041adf9b73cc6f92f0e24321dce12ac570a0`
MD5	`b7d21a4adbd6e5904db1bf2646402a22`
BLAKE2b-256	`cd4ed0362b92c7a8f62b4f37867195c795cd881efeac1c835937e9d9a4cfa9ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for simboba-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ntkris/simboba

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: simboba-0.1.0-py3-none-any.whl
- Subject digest: 0febfc138193dc80cc6df09aaf11041adf9b73cc6f92f0e24321dce12ac570a0
- Sigstore transparency entry: 774231536
- Sigstore integration time: Dec 20, 2025
Source repository:
- Permalink: ntkris/simboba@bb193840b61c918c8346878db1b767f8e505692d
- Branch / Tag: refs/tags/Release
- Owner: https://github.com/ntkris
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bb193840b61c918c8346878db1b767f8e505692d
- Trigger Event: release

simboba 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

boba

Installation

Quick Start

Commands

Testing Your Setup

Generating Datasets

How to Generate Good Datasets

Writing a Good Description

Via UI

Via CLI

Best Practices

Writing Evals

Basic Structure

Common Patterns

Running Evals

Via UI

Via CLI (for CI)

How Judging Works

Environment Variables

Environment Variables in Eval Files

Project Structure

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance