Lightweight evaluation platform for LLM experiments
Project description
Themis
Benchmark-first orchestration for reproducible LLM evaluation.
Themis now documents and supports one public authoring flow:
ProjectSpecfor shared storage and execution policyBenchmarkSpecfor benchmark slices, prompt variants, parse pipelines, scores, and agent-style prompt flowsPluginRegistryfor engines, parsers, metrics, judges, and hooksOrchestratorfor planning, execution, handoffs, and importsBenchmarkResultfor aggregation, paired comparison, artifact bundles, and timelinesgenerate_config_report(...)for reproducibility snapshotsthemis-quickcheckfor fast SQLite inspection by slice and benchmark dimension
Why Themis
- Benchmark-native authoring instead of experiment-matrix bookkeeping
- Query-aware dataset providers for subset, filter, and pushdown sampling
- Explicit prompt variants and parse pipelines instead of payload hacks
- Bootstrap prompt sequences, scripted follow-up turns, and first-class tool passing for agent-capable engines
- OpenAI-hosted MCP server support for remote tools during evaluation runs
- Projection-backed results with
slice_id,prompt_variant_id, and semantic dimensions - Local-first storage and deterministic reuse of completed work
- Seed-aware planning and per-candidate deterministic execution defaults
Installation
uv add themis-eval
Add extras only when needed:
statsfor paired comparisons and richer report toolingcompressionfor compressed artifact storageextractorsfor additional built-in parsing helpersmathfor math-equivalence scoring viamath-verifydatasetsfor dataset integrationsproviders-openai,providers-litellm,providers-vllmfor provider SDKstelemetryfor external observability callbacksstorage-postgresfor Postgres-backed storage
Quick Start
Start with a zero-friction smoke evaluation:
themis quick-eval inline \
--model demo-model \
--provider demo \
--input "2 + 2" \
--expected "4" \
--format json
That writes a SQLite store under:
.cache/themis/quick-eval/inline-demo-model-exact-match/themis.sqlite3
Initialize a real project scaffold when you want editable code and project files:
themis init starter-eval
Or start from a built-in benchmark definition:
themis quick-eval benchmark \
--benchmark mmlu_pro \
--model demo-model \
--provider demo \
--preview \
--format json
themis init starter-mmlu --benchmark mmlu_pro
Math benchmarks are available as built-ins too:
themis quick-eval benchmark \
--benchmark aime_2026 \
--model demo-model \
--provider demo \
--preview \
--format json
Then run the shipped hello-world benchmark when you want the smallest code-first example:
uv run python examples/01_hello_world.py
Expected output:
{'model_id': 'demo-model', 'slice_id': 'arithmetic', 'metric_id': 'exact_match', 'source': 'synthetic', 'prompt_variant_id': 'qa-default', 'mean': 1.0, 'count': 1}
That script shows the full benchmark-first loop:
- define a
DatasetProvider.scan(slice_spec, query) - register one engine and one metric
- build a
BenchmarkSpec - run
orchestrator.run_benchmark(...) - inspect the returned
BenchmarkResult
The complete script is embedded in docs/quick-start/index.md.
Examples
Runnable examples live in examples/:
01_hello_world.py02_project_file.py03_custom_extractor_metric.py04_compare_models.py05_resume_run.py06_hooks_and_timeline.py07_judge_metric.py08_external_stage_handoff.py09_experiment_evolution.py10_agent_eval.py11_quick_benchmark.py12_iter_and_estimate.py13_catalog_builtin_benchmark.py14_mcp_openai.py
10_agent_eval.py is the canonical advanced example for bootstrap prompts,
follow-up turns, tool declaration and selection, and returned agent traces.
13_catalog_builtin_benchmark.py is the catalog-specific example for running a
shipped builtin benchmark through themis.catalog.build_catalog_benchmark_project(...)
with a local fixture dataset loader.
14_mcp_openai.py shows the OpenAI-first MCP path for exposing a remote MCP
server to a benchmark run without using local ToolSpec handlers.
To discover all shipped builtin benchmark ids from Python, use:
from themis.catalog import list_catalog_benchmarks
print(list_catalog_benchmarks())
The canonical benchmark list and Python usage notes live in docs/guides/builtin-benchmarks.md.
examples/medical_reasoning_eval is intentionally left untouched as a handoff
reference. It is not the recommended public authoring pattern after the
benchmark-first redesign.
Documentation
- Docs site: https://pittawat2542.github.io/themis/
- Quick Start: docs/quick-start/index.md
- Tutorials: docs/tutorials/index.md
- Concepts: docs/concepts/index.md
- Guides: docs/guides/index.md
- API Reference: docs/api-reference/index.md
- FAQ: docs/faq/index.md
Development
uv sync --all-extras --dev
uv run pytest
uv run mkdocs build --strict
uv run ruff check
Contributing
See CONTRIBUTING.md.
Citation
If you use Themis in research, cite via CITATION.cff.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file themis_eval-3.1.0.tar.gz.
File metadata
- Download URL: themis_eval-3.1.0.tar.gz
- Upload date:
- Size: 235.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dc0cd2801b029e820a0bb8ce3cd6aedb6062d6839b5f8bdbf6a5b8e60ef883d
|
|
| MD5 |
a3483bac7c6d5e96d34ba64aeec9b312
|
|
| BLAKE2b-256 |
23ef3b75e5ebee396a9d438d6a9748348bde64d9ee59d08fc453dec10af95ce2
|
Provenance
The following attestation bundles were made for themis_eval-3.1.0.tar.gz:
Publisher:
pypi.yaml on Pittawat2542/themis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
themis_eval-3.1.0.tar.gz -
Subject digest:
1dc0cd2801b029e820a0bb8ce3cd6aedb6062d6839b5f8bdbf6a5b8e60ef883d - Sigstore transparency entry: 1159959264
- Sigstore integration time:
-
Permalink:
Pittawat2542/themis@461b6c09ce007b60f5950433104b80310b7d4fa6 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/Pittawat2542
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@461b6c09ce007b60f5950433104b80310b7d4fa6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file themis_eval-3.1.0-py3-none-any.whl.
File metadata
- Download URL: themis_eval-3.1.0-py3-none-any.whl
- Upload date:
- Size: 299.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb8fceccf5142ab0f68a634c4d4d572fc7576aa40f3c29c8bc14b674d15adead
|
|
| MD5 |
d786a040713905f8749d1d9509dbd104
|
|
| BLAKE2b-256 |
f4185fb3bf48c76611595c8f813dae06833fabcbde770e869bb2aa7234e831bb
|
Provenance
The following attestation bundles were made for themis_eval-3.1.0-py3-none-any.whl:
Publisher:
pypi.yaml on Pittawat2542/themis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
themis_eval-3.1.0-py3-none-any.whl -
Subject digest:
bb8fceccf5142ab0f68a634c4d4d572fc7576aa40f3c29c8bc14b674d15adead - Sigstore transparency entry: 1159959357
- Sigstore integration time:
-
Permalink:
Pittawat2542/themis@461b6c09ce007b60f5950433104b80310b7d4fa6 -
Branch / Tag:
refs/tags/v3.1.0 - Owner: https://github.com/Pittawat2542
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@461b6c09ce007b60f5950433104b80310b7d4fa6 -
Trigger Event:
push
-
Statement type: