Skip to main content

Lightweight evaluation platform for LLM experiments

Project description

Themis

Benchmark-first orchestration for reproducible LLM evaluation.

CI Docs PyPI version Python 3.12+ License: MIT

Themis now documents and supports one public authoring flow:

  • ProjectSpec for shared storage and execution policy
  • BenchmarkSpec for benchmark slices, prompt variants, parse pipelines, scores, and agent-style prompt flows
  • PluginRegistry for engines, parsers, metrics, judges, and hooks
  • Orchestrator for planning, execution, handoffs, and imports
  • BenchmarkResult for aggregation, paired comparison, artifact bundles, and timelines
  • generate_config_report(...) for reproducibility snapshots
  • themis-quickcheck for fast SQLite inspection by slice and benchmark dimension

Why Themis

  • Benchmark-native authoring instead of experiment-matrix bookkeeping
  • Query-aware dataset providers for subset, filter, and pushdown sampling
  • Explicit prompt variants and parse pipelines instead of payload hacks
  • Bootstrap prompt sequences, scripted follow-up turns, and first-class tool passing for agent-capable engines
  • OpenAI-hosted MCP server support for remote tools during evaluation runs
  • Projection-backed results with slice_id, prompt_variant_id, and semantic dimensions
  • Local-first storage and deterministic reuse of completed work
  • Seed-aware planning and per-candidate deterministic execution defaults

Installation

uv add themis-eval

Add extras only when needed:

  • stats for paired comparisons and richer report tooling
  • compression for compressed artifact storage
  • extractors for additional built-in parsing helpers
  • math for math-equivalence scoring via math-verify
  • datasets for dataset integrations
  • providers-openai, providers-litellm, providers-vllm for provider SDKs
  • telemetry for external observability callbacks
  • storage-postgres for Postgres-backed storage

Quick Start

Start with a zero-friction smoke evaluation:

themis quick-eval inline \
  --model demo-model \
  --provider demo \
  --input "2 + 2" \
  --expected "4" \
  --format json

That writes a SQLite store under:

.cache/themis/quick-eval/inline-demo-model-exact-match/themis.sqlite3

Initialize a real project scaffold when you want editable code and project files:

themis init starter-eval

Or start from a built-in benchmark definition:

themis quick-eval benchmark \
  --benchmark mmlu_pro \
  --model demo-model \
  --provider demo \
  --preview \
  --format json
themis init starter-mmlu --benchmark mmlu_pro

Math benchmarks are available as built-ins too:

themis quick-eval benchmark \
  --benchmark aime_2026 \
  --model demo-model \
  --provider demo \
  --preview \
  --format json

Then run the shipped hello-world benchmark when you want the smallest code-first example:

uv run python examples/01_hello_world.py

Expected output:

{'model_id': 'demo-model', 'slice_id': 'arithmetic', 'metric_id': 'exact_match', 'source': 'synthetic', 'prompt_variant_id': 'qa-default', 'mean': 1.0, 'count': 1}

That script shows the full benchmark-first loop:

  • define a DatasetProvider.scan(slice_spec, query)
  • register one engine and one metric
  • build a BenchmarkSpec
  • run orchestrator.run_benchmark(...)
  • inspect the returned BenchmarkResult

The complete script is embedded in docs/quick-start/index.md.

Examples

Runnable examples live in examples/:

  • 01_hello_world.py
  • 02_project_file.py
  • 03_custom_extractor_metric.py
  • 04_compare_models.py
  • 05_resume_run.py
  • 06_hooks_and_timeline.py
  • 07_judge_metric.py
  • 08_external_stage_handoff.py
  • 09_experiment_evolution.py
  • 10_agent_eval.py
  • 11_quick_benchmark.py
  • 12_iter_and_estimate.py
  • 13_catalog_builtin_benchmark.py
  • 14_mcp_openai.py

10_agent_eval.py is the canonical advanced example for bootstrap prompts, follow-up turns, tool declaration and selection, and returned agent traces.

13_catalog_builtin_benchmark.py is the catalog-specific example for running a shipped builtin benchmark through themis.catalog.build_catalog_benchmark_project(...) with a local fixture dataset loader.

14_mcp_openai.py shows the OpenAI-first MCP path for exposing a remote MCP server to a benchmark run without using local ToolSpec handlers.

To discover all shipped builtin benchmark ids from Python, use:

from themis.catalog import list_catalog_benchmarks

print(list_catalog_benchmarks())

The canonical benchmark list and Python usage notes live in docs/guides/builtin-benchmarks.md.

examples/medical_reasoning_eval is intentionally left untouched as a handoff reference. It is not the recommended public authoring pattern after the benchmark-first redesign.

Documentation

Development

uv sync --all-extras --dev
uv run pytest
uv run mkdocs build --strict
uv run ruff check

Contributing

See CONTRIBUTING.md.

Citation

If you use Themis in research, cite via CITATION.cff.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

themis_eval-3.1.0.tar.gz (235.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

themis_eval-3.1.0-py3-none-any.whl (299.9 kB view details)

Uploaded Python 3

File details

Details for the file themis_eval-3.1.0.tar.gz.

File metadata

  • Download URL: themis_eval-3.1.0.tar.gz
  • Upload date:
  • Size: 235.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for themis_eval-3.1.0.tar.gz
Algorithm Hash digest
SHA256 1dc0cd2801b029e820a0bb8ce3cd6aedb6062d6839b5f8bdbf6a5b8e60ef883d
MD5 a3483bac7c6d5e96d34ba64aeec9b312
BLAKE2b-256 23ef3b75e5ebee396a9d438d6a9748348bde64d9ee59d08fc453dec10af95ce2

See more details on using hashes here.

Provenance

The following attestation bundles were made for themis_eval-3.1.0.tar.gz:

Publisher: pypi.yaml on Pittawat2542/themis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file themis_eval-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: themis_eval-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 299.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for themis_eval-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb8fceccf5142ab0f68a634c4d4d572fc7576aa40f3c29c8bc14b674d15adead
MD5 d786a040713905f8749d1d9509dbd104
BLAKE2b-256 f4185fb3bf48c76611595c8f813dae06833fabcbde770e869bb2aa7234e831bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for themis_eval-3.1.0-py3-none-any.whl:

Publisher: pypi.yaml on Pittawat2542/themis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page