Skip to main content

Unit testing for AI Agents โ€” Python port of Cobalt

Project description

๐Ÿงช cobalt-python

Unit testing for AI Agents โ€” Python port of Cobalt

Tests PyPI Python License: MIT

Cobalt lets you write deterministic, repeatable tests for your LLM-powered agents and pipelines โ€” the same way you'd write unit tests for regular code.

This is the Python port. The original TypeScript SDK lives at basalt-ai/cobalt.


Features

  • Dataset loaders โ€” JSON, JSONL, CSV, remote URL, Langfuse, Langsmith, Braintrust, Basalt
  • Three evaluator types โ€” LLM-judge, custom function, semantic similarity
  • Async-native runner โ€” configurable concurrency + per-item timeout
  • SQLite history โ€” compare runs over time with cobalt history / cobalt compare
  • Local dashboard โ€” cobalt ui spins up a web UI with score charts, item drill-down, and run comparison
  • CI-ready โ€” declare score thresholds, get exit code 1 on regression
  • Rich CLI โ€” cobalt run, cobalt init, cobalt history, cobalt compare, cobalt ui, cobalt clean
  • MCP server โ€” cobalt mcp exposes 4 tools, 3 resources, 3 prompts to Claude and other MCP clients
  • Full docs โ€” docs/ matches TypeScript SDK structure and coverage

Install

pip install cobalt-ai

For development / from source:

git clone https://github.com/basalt-ai/cobalt-python
cd cobalt-python
pip install -e ".[dev]"

Quick start

# my_agent.cobalt.py
import asyncio
from cobalt import Dataset, Evaluator, EvalContext, EvalResult, ExperimentResult, experiment


async def my_agent(question: str) -> str:
    # Replace with your real LLM call
    return f"The answer is 42"


dataset = Dataset.from_items([
    {"input": "What is 6 ร— 7?", "expected_output": "42"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])


def exact_match(ctx: EvalContext) -> EvalResult:
    expected = str(ctx.item.get("expected_output", ""))
    score = 1.0 if expected in str(ctx.output) else 0.0
    return EvalResult(score=score, reason=f"Expected: {expected}")


async def main():
    await experiment(
        "my-agent",
        dataset,
        runner=lambda ctx: my_agent(ctx.item["input"]).then(
            lambda out: ExperimentResult(output=out)
        ),
        evaluators=[
            Evaluator(name="exact-match", type="function", fn=exact_match),
        ],
    )


asyncio.run(main())
cobalt run --file my_agent.cobalt.py

Evaluators

Function evaluator

def my_check(ctx: EvalContext) -> EvalResult:
    return EvalResult(score=1.0 if "yes" in ctx.output.lower() else 0.0)

Evaluator(name="contains-yes", type="function", fn=my_check)

LLM Judge

Evaluator(
    name="helpfulness",
    type="llm-judge",
    model="gpt-4o-mini",       # or claude-3-5-haiku, etc.
    scoring="boolean",          # "boolean" (PASS/FAIL) or "scale" (0โ€“1)
    chain_of_thought=True,
    prompt="""
You are evaluating an AI assistant's response.

Question: {{input}}
Response: {{output}}

Is the response helpful and accurate? Reply PASS or FAIL.
""",
)

Similarity

Evaluator(
    name="semantic-similarity",
    type="similarity",
    field="expected_output",   # dataset field to compare against
    threshold=0.7,             # score = 1.0 if similarity >= threshold
)

Datasets

# From Python
ds = Dataset.from_items([{"input": "hello", "expected": "world"}])

# From files
ds = Dataset.from_file("data.csv")     # csv / json / jsonl โ€” auto-detected
ds = Dataset.from_jsonl("data.jsonl")
ds = Dataset.from_json("data.json")

# Remote
ds = await Dataset.from_remote("https://example.com/data.jsonl")

# Platforms
ds = await Dataset.from_langfuse("my-dataset")
ds = await Dataset.from_langsmith("my-dataset")
ds = await Dataset.from_braintrust("my-project", "my-dataset")
ds = await Dataset.from_basalt("dataset-id")

# Transformations (chainable)
ds = ds.filter(lambda item, i: item["score"] > 0.5)
ds = ds.map(lambda item, i: {**item, "idx": i})
ds = ds.sample(100)
ds = ds.slice(0, 50)

Configuration

Create cobalt.toml in your project root (or run cobalt init):

[judge]
model = "gpt-4o-mini"
provider = "openai"
# api_key = "sk-..."  # or set OPENAI_API_KEY env var

[experiment]
concurrency = 5
timeout = 30
test_dir = "./experiments"

Dashboard

pip install 'cobalt-ai[dashboard]'
cobalt ui
# Opens http://localhost:4000

The local dashboard provides:

  • Run history with colour-coded score pills
  • Per-run score distribution chart (avg / p95 / min per evaluator)
  • Item-level drill-down โ€” input, output, evaluator reasons
  • Side-by-side run comparison

CLI

# Scaffold config + example experiment
cobalt init

# Run all *.cobalt.py files
cobalt run

# Run a specific file
cobalt run --file experiments/my-agent.cobalt.py

# CI mode โ€” exit 1 if thresholds violated
cobalt run --ci

# List recent runs
cobalt history --limit 20

# Compare two runs
cobalt compare <run-id-1> <run-id-2>

# Local web dashboard
cobalt ui --port 4000

# Delete all stored results
cobalt clean

CI Integration

from cobalt.types import ThresholdConfig, ThresholdMetric

thresholds = ThresholdConfig(
    evaluators={
        "exact-match": ThresholdMetric(avg=0.9, p95=0.7),
        "helpfulness":  ThresholdMetric(avg=0.8),
    }
)

report = await experiment(
    "my-agent", dataset, runner,
    evaluators=[...],
    thresholds=thresholds,
)
# .github/workflows/eval.yml
- name: Run evaluations
  run: cobalt run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Architecture

src/cobalt/
โ”œโ”€โ”€ __init__.py          # Public API surface
โ”œโ”€โ”€ types.py             # All dataclasses
โ”œโ”€โ”€ config.py            # cobalt.toml loader
โ”œโ”€โ”€ dataset.py           # Dataset class
โ”œโ”€โ”€ evaluator.py         # Evaluator + registry
โ”œโ”€โ”€ experiment.py        # Core runner
โ”œโ”€โ”€ evaluators/
โ”‚   โ”œโ”€โ”€ function.py      # Custom function evaluator
โ”‚   โ”œโ”€โ”€ llm_judge.py     # LLM-judge evaluator
โ”‚   โ””โ”€โ”€ similarity.py    # TF-IDF cosine similarity
โ”œโ”€โ”€ storage/
โ”‚   โ”œโ”€โ”€ db.py            # SQLite history
โ”‚   โ””โ”€โ”€ results.py       # JSON result files
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ stats.py         # Descriptive statistics
โ”‚   โ”œโ”€โ”€ template.py      # {{variable}} rendering
โ”‚   โ””โ”€โ”€ cost.py          # Token cost estimation
โ””โ”€โ”€ cli/
    โ””โ”€โ”€ main.py          # cobalt CLI (Typer)

Development

# Install
pip install -e ".[dev]"

# Test
pytest tests/ -v

# Lint
ruff check src/ tests/

Relationship to TypeScript Cobalt

Feature TypeScript Python
Dataset loaders โœ… โœ…
LLM judge โœ… โœ…
Function evaluator โœ… โœ…
Similarity โœ… โœ… (TF-IDF)
CLI โœ… โœ…
History / compare โœ… โœ…
SQLite storage โœ… โœ…
CI thresholds โœ… โœ…
Local dashboard โœ… โœ… (cobalt ui)
MCP integration โœ… โœ… (cobalt mcp)
Platform integrations Langfuse, Langsmith, Braintrust, Basalt โœ… same

Python conventions used throughout: async/await, dataclasses, asyncio.Semaphore, typer, rich.


License

MIT โ€” see LICENSE.

Built by Basalt AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basalt_ai_cobalt-0.1.0.tar.gz (144.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

basalt_ai_cobalt-0.1.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file basalt_ai_cobalt-0.1.0.tar.gz.

File metadata

  • Download URL: basalt_ai_cobalt-0.1.0.tar.gz
  • Upload date:
  • Size: 144.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for basalt_ai_cobalt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a9364dc6b6ba86c32c9e2204f7413acaf3636e25361b7e862243d7d521044b0
MD5 3bc5a3893d8299179638c3e02abe4999
BLAKE2b-256 af00977a6a0d24e5ac80d24d64a85e27ccf3c71d6d2b59abdde7c71311761492

See more details on using hashes here.

File details

Details for the file basalt_ai_cobalt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: basalt_ai_cobalt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for basalt_ai_cobalt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad443712d42d2aaf1c79ba63aa92fe3bcfc7f6b0c6e46e0b0eab4b1720b627f9
MD5 7998c89bb9faf0ce9af8c51cf451e378
BLAKE2b-256 75b44849725aaf4a471626c4bbd4356b057e98c2c1fa9d320f238ff0937a607c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page