Unit testing for AI Agents โ Python port of Cobalt
Project description
๐งช cobalt-python
Unit testing for AI Agents โ Python port of Cobalt
Cobalt lets you write deterministic, repeatable tests for your LLM-powered agents and pipelines โ the same way you'd write unit tests for regular code.
This is the Python port. The original TypeScript SDK lives at basalt-ai/cobalt.
Features
- Dataset loaders โ JSON, JSONL, CSV, remote URL, Langfuse, Langsmith, Braintrust, Basalt
- Three evaluator types โ LLM-judge, custom function, semantic similarity
- Async-native runner โ configurable concurrency + per-item timeout
- SQLite history โ compare runs over time with
cobalt history/cobalt compare - Local dashboard โ
cobalt uispins up a web UI with score charts, item drill-down, and run comparison - CI-ready โ declare score thresholds, get exit code 1 on regression
- Rich CLI โ
cobalt run,cobalt init,cobalt history,cobalt compare,cobalt ui,cobalt clean - MCP server โ
cobalt mcpexposes 4 tools, 3 resources, 3 prompts to Claude and other MCP clients - Full docs โ docs/ matches TypeScript SDK structure and coverage
Install
pip install cobalt-ai
For development / from source:
git clone https://github.com/basalt-ai/cobalt-python
cd cobalt-python
pip install -e ".[dev]"
Quick start
# my_agent.cobalt.py
import asyncio
from cobalt import Dataset, Evaluator, EvalContext, EvalResult, ExperimentResult, experiment
async def my_agent(question: str) -> str:
# Replace with your real LLM call
return f"The answer is 42"
dataset = Dataset.from_items([
{"input": "What is 6 ร 7?", "expected_output": "42"},
{"input": "What is the capital of France?", "expected_output": "Paris"},
])
def exact_match(ctx: EvalContext) -> EvalResult:
expected = str(ctx.item.get("expected_output", ""))
score = 1.0 if expected in str(ctx.output) else 0.0
return EvalResult(score=score, reason=f"Expected: {expected}")
async def main():
await experiment(
"my-agent",
dataset,
runner=lambda ctx: my_agent(ctx.item["input"]).then(
lambda out: ExperimentResult(output=out)
),
evaluators=[
Evaluator(name="exact-match", type="function", fn=exact_match),
],
)
asyncio.run(main())
cobalt run --file my_agent.cobalt.py
Evaluators
Function evaluator
def my_check(ctx: EvalContext) -> EvalResult:
return EvalResult(score=1.0 if "yes" in ctx.output.lower() else 0.0)
Evaluator(name="contains-yes", type="function", fn=my_check)
LLM Judge
Evaluator(
name="helpfulness",
type="llm-judge",
model="gpt-4o-mini", # or claude-3-5-haiku, etc.
scoring="boolean", # "boolean" (PASS/FAIL) or "scale" (0โ1)
chain_of_thought=True,
prompt="""
You are evaluating an AI assistant's response.
Question: {{input}}
Response: {{output}}
Is the response helpful and accurate? Reply PASS or FAIL.
""",
)
Similarity
Evaluator(
name="semantic-similarity",
type="similarity",
field="expected_output", # dataset field to compare against
threshold=0.7, # score = 1.0 if similarity >= threshold
)
Datasets
# From Python
ds = Dataset.from_items([{"input": "hello", "expected": "world"}])
# From files
ds = Dataset.from_file("data.csv") # csv / json / jsonl โ auto-detected
ds = Dataset.from_jsonl("data.jsonl")
ds = Dataset.from_json("data.json")
# Remote
ds = await Dataset.from_remote("https://example.com/data.jsonl")
# Platforms
ds = await Dataset.from_langfuse("my-dataset")
ds = await Dataset.from_langsmith("my-dataset")
ds = await Dataset.from_braintrust("my-project", "my-dataset")
ds = await Dataset.from_basalt("dataset-id")
# Transformations (chainable)
ds = ds.filter(lambda item, i: item["score"] > 0.5)
ds = ds.map(lambda item, i: {**item, "idx": i})
ds = ds.sample(100)
ds = ds.slice(0, 50)
Configuration
Create cobalt.toml in your project root (or run cobalt init):
[judge]
model = "gpt-4o-mini"
provider = "openai"
# api_key = "sk-..." # or set OPENAI_API_KEY env var
[experiment]
concurrency = 5
timeout = 30
test_dir = "./experiments"
Dashboard
pip install 'cobalt-ai[dashboard]'
cobalt ui
# Opens http://localhost:4000
The local dashboard provides:
- Run history with colour-coded score pills
- Per-run score distribution chart (avg / p95 / min per evaluator)
- Item-level drill-down โ input, output, evaluator reasons
- Side-by-side run comparison
CLI
# Scaffold config + example experiment
cobalt init
# Run all *.cobalt.py files
cobalt run
# Run a specific file
cobalt run --file experiments/my-agent.cobalt.py
# CI mode โ exit 1 if thresholds violated
cobalt run --ci
# List recent runs
cobalt history --limit 20
# Compare two runs
cobalt compare <run-id-1> <run-id-2>
# Local web dashboard
cobalt ui --port 4000
# Delete all stored results
cobalt clean
CI Integration
from cobalt.types import ThresholdConfig, ThresholdMetric
thresholds = ThresholdConfig(
evaluators={
"exact-match": ThresholdMetric(avg=0.9, p95=0.7),
"helpfulness": ThresholdMetric(avg=0.8),
}
)
report = await experiment(
"my-agent", dataset, runner,
evaluators=[...],
thresholds=thresholds,
)
# .github/workflows/eval.yml
- name: Run evaluations
run: cobalt run --ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Architecture
src/cobalt/
โโโ __init__.py # Public API surface
โโโ types.py # All dataclasses
โโโ config.py # cobalt.toml loader
โโโ dataset.py # Dataset class
โโโ evaluator.py # Evaluator + registry
โโโ experiment.py # Core runner
โโโ evaluators/
โ โโโ function.py # Custom function evaluator
โ โโโ llm_judge.py # LLM-judge evaluator
โ โโโ similarity.py # TF-IDF cosine similarity
โโโ storage/
โ โโโ db.py # SQLite history
โ โโโ results.py # JSON result files
โโโ utils/
โ โโโ stats.py # Descriptive statistics
โ โโโ template.py # {{variable}} rendering
โ โโโ cost.py # Token cost estimation
โโโ cli/
โโโ main.py # cobalt CLI (Typer)
Development
# Install
pip install -e ".[dev]"
# Test
pytest tests/ -v
# Lint
ruff check src/ tests/
Relationship to TypeScript Cobalt
| Feature | TypeScript | Python |
|---|---|---|
| Dataset loaders | โ | โ |
| LLM judge | โ | โ |
| Function evaluator | โ | โ |
| Similarity | โ | โ (TF-IDF) |
| CLI | โ | โ |
| History / compare | โ | โ |
| SQLite storage | โ | โ |
| CI thresholds | โ | โ |
| Local dashboard | โ | โ
(cobalt ui) |
| MCP integration | โ | โ
(cobalt mcp) |
| Platform integrations | Langfuse, Langsmith, Braintrust, Basalt | โ same |
Python conventions used throughout: async/await, dataclasses, asyncio.Semaphore, typer, rich.
License
MIT โ see LICENSE.
Built by Basalt AI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file basalt_ai_cobalt-0.2.3.tar.gz.
File metadata
- Download URL: basalt_ai_cobalt-0.2.3.tar.gz
- Upload date:
- Size: 147.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5868d2b971444a022f4ec01816e59f8a96c63e0d1ce1658aed8bd396dfe328a
|
|
| MD5 |
2d6385ade8375ee9eb0d9f550e0a5088
|
|
| BLAKE2b-256 |
e47828f7cf5a409c132a01722e521a2bc39b81fd71ee365922768e8101537ebc
|
File details
Details for the file basalt_ai_cobalt-0.2.3-py3-none-any.whl.
File metadata
- Download URL: basalt_ai_cobalt-0.2.3-py3-none-any.whl
- Upload date:
- Size: 45.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d083f5efd6f1189922af1bbba9ea342abffea2fd8301eb7d104ee8ad3a73802
|
|
| MD5 |
b927e46dfa0ee373006553c1f046861d
|
|
| BLAKE2b-256 |
4484ee345ccbbea700d88ed03faefec8191fa04390d205fbd4f3e6ca08c3423c
|