Skip to main content

CLI and Python client for SWECC's benchmark and eval platform.

Project description

swecc-mesocosm

CLI and Python client for SWECC's benchmark and eval platform.

A mesocosm is a small, enclosed environment used for controlled experiments — which is exactly what this tool helps you build, register, and run evals against.

Install

pip install swecc-mesocosm
# or, with uv:
uv tool install swecc-mesocosm
# or, with pipx:
pipx install swecc-mesocosm

For local development against this monorepo:

pip install -e ./packages/swecc-mesocosm

Configure

The CLI reads MESOCOSM_BASE_URL from the environment (default: http://127.0.0.1:8010, matching BENCH_API_PORT in docker compose). You can also pass --base-url to any command.

Production:

export MESOCOSM_BASE_URL=https://api.swecc.org/bench
mesocosm doctor   # verify health + openapi

See infra/mesocosm.env.example in the monorepo root.

export MESOCOSM_BASE_URL=http://127.0.0.1:8010   # docker compose
# or
export MESOCOSM_BASE_URL=https://api.swecc.org/bench

Commands

mesocosm --help

# connectivity check (bench-api health + openapi)
mesocosm doctor
mesocosm doctor --base-url https://api.swecc.org/bench

# inference + validation (no network)
mesocosm suggest "Wordle clone where the agent gets 6 guesses."
mesocosm validate ./my-domain.json

# domain CRUD
mesocosm register --id my-bench --name "My Bench" --owner-id me \
  --description "Trivia about Python." --env-url https://envs.example.com/mybench
mesocosm publish my-bench
mesocosm get my-bench --artifacts
mesocosm list --status published

# evals
mesocosm eval test --domain-id my-bench --vow-version 1.0.0 --model openai/gpt-4o-mini
mesocosm eval run  --domain-id my-bench --vow-version 1.0.0 --model openai/gpt-4o-mini \
  --num-episodes 20 --seed-set '[1,2,3]'

# results
mesocosm run get <run-id>
mesocosm run episodes <run-id> --traces

All commands print JSON to stdout (pretty when stdout is a TTY, compact otherwise), so they pipe cleanly into jq:

mesocosm list --status published | jq '.[].id'

Local vs bench-api commands

Local means the CLI does not call bench-api at MESOCOSM_BASE_URL (no HTTP to /v1/...). That is not the same as “no LLM”: model calls happen on the server when you use eval commands.

Bench-api means the command needs a reachable bench-api (MESOCOSM_BASE_URL or --base-url on the command).

Local (no bench-api)

Command What it does
mesocosm --version / -V Print the installed package version.
mesocosm suggest <description> Regex heuristics on your text → JSON defaults (benchmark_kind, scoring_source, max_steps, primary_metric, reasoning, tags). Preview only; does not register.
mesocosm validate <path> Check a domain JSON payload against shipped policy/constraints.json (- = stdin). Exit 0 if ok, else 1.

These work without bench-api running.

Bench-api (HTTP)

Command API What it does
mesocosm register POST /v1/domains (409 → PATCH) Build or load a payload, optionally run local validate, then upsert a draft domain.
mesocosm publish <id> POST /v1/domains/{id}/publish Publish a domain; print artifact SHA-256 digests.
mesocosm get <id> GET /v1/domains/{id} Fetch a domain; --artifacts adds synthesized contract files locally.
mesocosm list GET /v1/domains List domains (--status, --json for raw output).
mesocosm eval test POST /v1/test/episode One test episode (model + env on the server).
mesocosm eval run GET domain + POST /v1/runs Full eval run with aggregated scores.
mesocosm run get <run-id> GET /v1/runs/{id} (+ episodes) Run status and aggregate scores.
mesocosm run episodes <run-id> GET /v1/runs/{id}/episodes Episode list; --traces fetches traces too.

register is hybrid: inference and validate run locally; the upsert step needs bench-api.

LOCAL                          BENCH-API
────────────────────────────   ─────────────────────────────────────
mesocosm --version             mesocosm register
mesocosm suggest "<desc>"      mesocosm publish <id>
mesocosm validate <file>       mesocosm get <id> [--artifacts]
                               mesocosm list [--status ...] [--json]
                               mesocosm eval test ...
                               mesocosm eval run ...
                               mesocosm run get <run-id>
                               mesocosm run episodes <run-id> [--traces]

Python client

import asyncio
from swecc_mesocosm import BenchClient

async def main():
    c = BenchClient(base_url="http://127.0.0.1:8000")
    try:
        domains = await c.list_domains(published_only=True)
        print(len(domains), "published")
    finally:
        await c.aclose()

asyncio.run(main())

Policy / constraints

mesocosm validate reads swecc_mesocosm/policy/constraints.json shipped with the package — required register fields, allowed model prefixes, etc. Edit that file (or fork the package) to tune for your event.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swecc_mesocosm-0.1.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swecc_mesocosm-0.1.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file swecc_mesocosm-0.1.0.tar.gz.

File metadata

  • Download URL: swecc_mesocosm-0.1.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for swecc_mesocosm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a664edc82e26dda8a50360e76ecc48683e1a76ab7281c57864c75d371e6c0dc
MD5 53e0ea9b4dc57a2f6b1dd4ddf6a2d91a
BLAKE2b-256 711e855470afd3715649092bfaff07459e3f5a1aad7333540477e7eb23ae82c1

See more details on using hashes here.

File details

Details for the file swecc_mesocosm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: swecc_mesocosm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for swecc_mesocosm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f211a9803953ed83e9fa9e0309a514fd3cca0ab35d9bd8dc6edc51126b15cce
MD5 cb02ac7c80217b2520b2d94a048ba64e
BLAKE2b-256 1032df68f7ab0de16d3dc9303cf01c4ece148c3dd044f5aaf4fff6f49ab8f022

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page