A lightweight tool for generating annotated eval datasets and running LLM-as-judge evaluations
Project description
boba
( )
.-~~~-.
/ \
| === |
| ::::: |
|_:::::_|
'---'
LLM eval datasets & judge. Generate test cases, run evals, and judge results with AI.
Installation
pip install simboba
Quick Start
boba init # Create evals/ folder with template
boba serve # Start web UI at http://localhost:8787
Commands
| Command | Description |
|---|---|
boba init |
Create evals/ folder with starter template |
boba serve |
Start web UI (auto-loads evals from evals/) |
boba serve --config path |
Load specific eval file or folder |
boba test |
Test connection to your agent |
boba test --eval name |
Test a specific eval |
boba evals |
List loaded evals and show any errors |
boba run --dataset name |
Run evals headlessly (for CI) |
boba datasets |
List all datasets |
boba generate "description" |
Generate dataset from CLI |
boba export --dataset name -o file.json |
Export dataset |
boba import -i file.json |
Import dataset |
boba reset |
Delete database (all data) |
Testing Your Setup
Before running full evals, verify your eval function can connect to your agent:
# Test with default message
boba test
# Test a specific eval
boba test --eval my-agent
# Test with a custom message
boba test -m "What is the status of order 123?"
This sends a test message through your eval function and shows the response (or error). Use this to debug connection issues before running full evaluations.
You can also test from the web UI: click "New Run" and then "Test connection first →".
Generating Datasets
Datasets are collections of test cases. Each case has:
- inputs: Conversation messages (user/assistant turns)
- expected_outcome: What the agent should do
How to Generate Good Datasets
The key to good test cases is understanding your app first. Before generating:
- Study your app's logic - Read the code, understand the flows
- Identify key user journeys - What are the main things users do?
- Note edge cases - What can go wrong? What are the limits?
- Describe clearly - The better your description, the better the test cases
Writing a Good Description
Bad:
A chatbot for construction sites
Good:
A WhatsApp-based AI assistant for construction site staff.
KEY FLOWS:
1. Daily site logs - User sends photos/voice notes describing work progress,
weather, labor count, equipment. Agent should extract structured data and
confirm what was logged.
2. Safety incidents - User reports an incident. Agent must collect: location,
time, people involved, injuries, witnesses. Must escalate serious incidents.
3. Material requests - User requests materials. Agent checks inventory,
suggests alternatives if unavailable, creates purchase order.
EDGE CASES:
- User sends unclear voice note - agent should ask for clarification
- User reports injury - agent must immediately escalate, not just log
- User requests unavailable material - agent should suggest alternatives
USERS: Site managers (tech-savvy), foremen (moderate), workers (basic phones)
Via UI
- Run
boba serve - Click "New Dataset"
- Choose "Generate with AI"
- Paste your detailed description
- Review and edit generated cases
Via CLI
boba generate "Your detailed description here"
Best Practices
- Study your code first - Read handlers, prompts, business logic
- Map user flows - List the main journeys step by step
- Include multi-turn conversations - Real users have back-and-forth dialogue
- Be specific in expected outcomes - "Should ask for order number" not "Should help"
- Test different user types - New users, experts, frustrated users
- Include failure cases - What should happen when the agent can't help?
- Cover edge cases - Invalid inputs, missing data, system limits
Writing Evals
Evals connect your agent/API to boba for testing. Create Python files in the evals/ folder.
Basic Structure
from simboba import Eval
def my_agent(messages):
"""
Called for each test case.
Args:
messages: List of conversation messages
[{"role": "user", "message": "...", "attachments": []}, ...]
Returns:
Agent's response as a string
"""
# Call your API and return the response
response = requests.post("http://localhost:8000/chat", json={
"messages": [{"role": m["role"], "content": m["message"]} for m in messages]
})
return response.json()["response"]
# Register the eval
my_eval = Eval(name="my-agent", fn=my_agent)
Common Patterns
Direct Python Call (simplest):
# Import your agent directly - no HTTP/auth needed
from my_project.agent import generate_response
def my_agent(messages):
# Call your function directly
return generate_response(messages[-1]["message"])
my_eval = Eval(name="my-agent", fn=my_agent)
HTTP API:
def my_agent(messages):
resp = requests.post("http://localhost:8000/chat", json={"messages": messages})
return resp.json()["response"]
Async/Webhook API:
def my_agent(messages):
# Trigger webhook
job = requests.post("http://localhost:8000/webhook", json={"message": messages[-1]})
job_id = job.json()["job_id"]
# Poll for result
for _ in range(30):
time.sleep(1)
result = requests.get(f"http://localhost:8000/jobs/{job_id}")
if result.json()["status"] == "complete":
return result.json()["response"]
raise TimeoutError("Agent didn't respond")
With Auth/Setup:
# Setup runs once when file loads
API_KEY = os.environ["MY_API_KEY"]
TEST_USER = create_test_user()
def my_agent(messages):
resp = requests.post(
"http://localhost:8000/chat",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"user_id": TEST_USER["id"], "messages": messages}
)
return resp.json()["response"]
my_eval = Eval(name="my-agent", fn=my_agent)
Running Evals
Via UI
- Run
boba serve - Go to "Runs" tab
- Click "New Run"
- Select dataset and eval
- View results with pass/fail and reasoning
Via CLI (for CI)
boba run --dataset my-dataset
How Judging Works
- Your eval function receives the test case inputs
- Your function calls your agent and returns its response
- An LLM judge compares the response against the expected outcome
- Judge returns pass/fail with reasoning
Environment Variables
Boba automatically loads .env files from the current directory.
| Variable | Description |
|---|---|
ANTHROPIC_API_KEY |
For Claude models (generation & judging) |
OPENAI_API_KEY |
For OpenAI models |
GEMINI_API_KEY |
For Gemini models |
Environment Variables in Eval Files
Important: Validate environment variables inside your eval function, not at module level. This prevents silent import failures and gives clear error messages.
# BAD - fails silently if env var is missing
API_KEY = os.environ["MY_API_KEY"] # Module level = import error
def my_agent(messages):
...
# GOOD - clear error when the eval actually runs
def my_agent(messages):
api_key = os.environ.get("MY_API_KEY")
if not api_key:
raise ValueError("MY_API_KEY not set")
...
If your eval files aren't showing up, run boba evals to see import errors.
Project Structure
After running boba init:
your-project/
├── evals/
│ ├── .gitignore # Ignores database
│ ├── README.md # This file
│ ├── example.py # Starter template
│ └── simboba.db # Database (created on first run, not tracked)
└── ...
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simboba-0.1.0.tar.gz.
File metadata
- Download URL: simboba-0.1.0.tar.gz
- Upload date:
- Size: 45.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f4396474d9dc5e5f29c112e02f6984f3c0aae3cacd034848f6be07b484673ef
|
|
| MD5 |
ae3df6b378ac49b70bb6ff12a9447f4c
|
|
| BLAKE2b-256 |
c924df547d450d2de4ec5ed6d1fb7ad4f93d8b595a7379f4cb1dff087e960dc0
|
Provenance
The following attestation bundles were made for simboba-0.1.0.tar.gz:
Publisher:
publish.yml on ntkris/simboba
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simboba-0.1.0.tar.gz -
Subject digest:
9f4396474d9dc5e5f29c112e02f6984f3c0aae3cacd034848f6be07b484673ef - Sigstore transparency entry: 774231535
- Sigstore integration time:
-
Permalink:
ntkris/simboba@bb193840b61c918c8346878db1b767f8e505692d -
Branch / Tag:
refs/tags/Release - Owner: https://github.com/ntkris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bb193840b61c918c8346878db1b767f8e505692d -
Trigger Event:
release
-
Statement type:
File details
Details for the file simboba-0.1.0-py3-none-any.whl.
File metadata
- Download URL: simboba-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0febfc138193dc80cc6df09aaf11041adf9b73cc6f92f0e24321dce12ac570a0
|
|
| MD5 |
b7d21a4adbd6e5904db1bf2646402a22
|
|
| BLAKE2b-256 |
cd4ed0362b92c7a8f62b4f37867195c795cd881efeac1c835937e9d9a4cfa9ac
|
Provenance
The following attestation bundles were made for simboba-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ntkris/simboba
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simboba-0.1.0-py3-none-any.whl -
Subject digest:
0febfc138193dc80cc6df09aaf11041adf9b73cc6f92f0e24321dce12ac570a0 - Sigstore transparency entry: 774231536
- Sigstore integration time:
-
Permalink:
ntkris/simboba@bb193840b61c918c8346878db1b767f8e505692d -
Branch / Tag:
refs/tags/Release - Owner: https://github.com/ntkris
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bb193840b61c918c8346878db1b767f8e505692d -
Trigger Event:
release
-
Statement type: