Touchstone — the AI Adoption Lifecycle platform. AI with receipts: forge it in the Crucible, ship what proves.
Project description
Touchstone
Touchstone — AI with receipts. Forge it in the Crucible. Ship what proves.
Touchstone is the umbrella AI-certification platform. Crucible is its HELM-style benchmark engine: multi-leaderboard AI benchmarking for enterprise — legacy modernization, healthcare AI, frontier safety, agentic tool-use, code patch generation, and per-vertical / per-regulation evaluation, all under one FastAPI service with shared persistence, an HMAC-chained audit ledger, evidence-bundle export, and an Inspect-AI-compatible eval-log exporter.
40 leaderboards. 19+ metrics. 780+ tests green. Last sync: 2026-05-17 (Wave 9 Track A/B).
What you can do today
Source:
docs/diagrams/request-flow.mmd— render vianpx -y @mermaid-js/mermaid-cli -i docs/diagrams/request-flow.mmd -o request-flow.svg
flowchart LR
User["User / CI / dashboard"] --> API["FastAPI<br/>app/main.py"]
API --> Core["Crucible<br/>HELM-style eval-core<br/>app/eval/"]
API --> Legacy["Legacy<br/>GymSpec runtime<br/>app/engine.py"]
subgraph Boards["40 leaderboards (auto-discovered)"]
Modern["Modernization × 5<br/>classic / evidence / robustness<br/>agentic v1.2 / safety"]
SWE["SWE-Bench Verified<br/>(struct + Docker exec)"]
Med["MedHELM (10 RunSpecs)"]
Vert["Per-vertical × 17<br/>HCLS · retail · fin-svc<br/>hi-tech · industry · edu-K12"]
Comp["Compliance × 9<br/>HIPAA · PCI-DSS · SOX · GDPR<br/>CMS · HTI-1 · 3 state AI<br/>OWASP × 2 · MITRE ATLAS"]
Safe["Frontier safety × 2<br/>WMDP · medical_safety<br/>+ agent_threat_safety"]
end
Core --> Boards
Boards --> Snap["Postgres :25432<br/>leaderboard_snapshots"]
Boards --> Traces["traces/<run_id>/*.json"]
Boards --> Ledger["audit_ledger.jsonl<br/>(HMAC chained)"]
Snap --> Bundle["Evidence bundle ZIP<br/>/evidence_bundle"]
Traces --> Bundle
Ledger --> Bundle
Snap --> Inspect["Inspect AI EvalLog v2<br/>/inspect_log"]
Traces --> Inspect
- Run any of 30 leaderboards against any model with hash-keyed request caching
- Stream HuggingFace datasets with
access_tier: public | gated | private(setHF_TOKENfor gated,MODELGYM_HF_OFFLINE=1for airgap / CI) - Score with 19+ metrics across 7 groups (core, evidence, HCLS, compliance, agentic, safety, code)
- Export an Inspect-AI-compatible EvalLog v2 JSON for any run
- Export an evidence bundle (snapshot + traces + audit slice + scenario fingerprints) per run
- Drive the legacy GymSpec onboarding/certification path on the same store
Leaderboard catalog
| Category | ID prefix | Boards | Headline metric |
|---|---|---|---|
| Modernization (core) | modernization_* |
5 | varies per board |
| · classic | modernization_classic |
1 | exact_match + judge |
| · evidence | modernization_evidence |
1 | citation_f1 + hallucination_rate |
| · robustness | modernization_robustness |
1 | robustness_delta |
| · agentic | modernization_agentic v1.2 |
1 | step_success_rate + action_set_jaccard + task_completed_partial (TAU-Bench live mocks) |
| · safety | modernization_safety |
1 | wmdp_safety (inverted) |
| SWE-Bench Verified | swe_bench_verified v1.1 |
1 | patch_files_match_gold + swe_bench_resolved (opt-in Docker) |
| MedHELM | medhelm_modernization |
1 | jury_score + bertscore |
| Vertical | vertical_* |
17 | per-vertical jury_score |
| Compliance | compliance_* |
9 | control_match (HIPAA, PCI-DSS, SOX, GDPR, CMS-Interop, HTI-1, NYC LL 144, CA AI laws, IL AI laws + OWASP LLM Top 10, OWASP Agentic, MITRE ATLAS) |
| Frontier safety | modernization_safety, medical_safety |
2 | wmdp_safety, red_team_safety (inverted) |
| Agent threat | agent_threat_safety |
1 | atr_safety (ATR rule pattern coverage) |
For the full registry: uv run python scripts/export_registry.py.
Quick start
# install (uv is the toolchain)
uv sync
# start Postgres on 25432 + apply migrations
docker compose up -d db
uv run alembic upgrade head
# run the full eval test suite (offline by default)
MODELGYM_HF_OFFLINE=1 uv run pytest tests/eval/ -q
# boot the API + dashboard
uv run uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/leaderboards.html
# → http://localhost:8000/dashboard (legacy GymSpec UI)
Optional live data
# pull real HF datasets (MedHELM, WMDP, SWE-Bench Verified, etc.)
unset MODELGYM_HF_OFFLINE
export HF_TOKEN=hf_xxx # only needed for gated-tier scenarios
uv run pytest tests/eval/test_medhelm_real_data.py -q
Install
# Canonical package name (formerly model-gym):
pip install touchstone-platform
# Docker image:
docker pull ghcr.io/yadavilli-solutions/touchstone
See MIGRATION_TO_TOUCHSTONE.md if you are upgrading from model-gym.
Environment variables
| Var | New name (TOUCHSTONE_*) | Purpose | Required in prod |
|---|---|---|---|
DATABASE_URL |
— | Postgres DSN (default: :25432/modelgym) |
yes |
AUDIT_HMAC_SECRET |
— | HMAC key for audit ledger chain | yes |
MODELGYM_AUDIT_LEDGER (deprecated; use TOUCHSTONE_AUDIT_LEDGER) |
TOUCHSTONE_AUDIT_LEDGER |
Path to append-only ledger JSONL | yes |
MODELGYM_TRACE_ROOT (deprecated; use TOUCHSTONE_TRACE_ROOT) |
TOUCHSTONE_TRACE_ROOT |
Root dir for per-instance trace JSON | yes |
MODELGYM_REQUEST_CACHE_PATH (deprecated; use TOUCHSTONE_REQUEST_CACHE_PATH) |
TOUCHSTONE_REQUEST_CACHE_PATH |
SQLite path for hash-keyed model-call cache | yes |
MODELGYM_HF_OFFLINE (deprecated; use TOUCHSTONE_HF_OFFLINE) |
TOUCHSTONE_HF_OFFLINE |
1 to forbid HF network calls during tests/airgap |
no |
HF_TOKEN |
— | HuggingFace token — required for access_tier=gated |
when gated used |
Adding a new benchmark
- Drop a scenario class in
app/eval/specs/<your_thing>.py(subclassHuggingFaceDatasetScenariofor HF, or implementScenariodirectly). - Drop a leaderboard registration in
app/eval/leaderboards/<your_thing>.py(callsREGISTRY.register(Leaderboard(...))at import time). - FastAPI auto-discovers it on next boot. Add a test under
tests/eval/test_<your_thing>.py.
For a worked example, see app/eval/specs/wmdp_scenarios.py +
app/eval/leaderboards/modernization_safety.py + tests/eval/test_wmdp.py.
Documentation map
architecture.md— full architectural truth (read this first when extending)ROADMAP.md— what shipped, what's nextMIGRATION_TO_TOUCHSTONE.md— rename guidedocs/WAVE_LOG.md— Plans 1-7 vs ship reality + Waves 1-5docs/superpowers/plans/— original written plans (superseded; WAVE_LOG is canonical)PATH_TO_PROD.md— known prod-readiness boundaries
Boundaries (not yet production-safe)
X-Touchstone-Roleheader (formerlyX-Gym-Role) is dev-mode only. SetMODELGYM_ENV=production(orTOUCHSTONE_ENV=production) to fail-close — Keycloak is then required (orMODELGYM_ALLOW_DEV_AUTH=1/TOUCHSTONE_ALLOW_DEV_AUTH=1for airgap deploys that issue their own local JWTs).- SWE-Bench Docker execution (FAIL_TO_PASS / PASS_TO_PASS) is opt-in
via
MODELGYM_SWE_BENCH_EXEC=1(orTOUCHSTONE_SWE_BENCH_EXEC=1) + a docker daemon + the upstreamswebench/sweb.eval.*per-instance images. Without it, scoring is structural-only. - WMDP scoring remains structural (frontier-safety MCQ). TAU-Bench has
live retail + airline mocks (Waves 7.4 + 7.5); 4 of 12 vendored tasks
have
gold_final_stateforTaskCompletionMetric— remaining authoring is rolling content work. - BERTScore falls back to
rouge_Lwhen thebert_scorepackage isn't installed (setMODELGYM_BERTSCORE_REQUIRE_REAL=1/TOUCHSTONE_BERTSCORE_REQUIRE_REAL=1to fail-close). - Inspect AI export emits both
.jsonand a real in-tree.evalzipfile (Wave 7.1) — noinspect-aidep required.
License
See LICENSE.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file touchstone_platform-1.0.2.tar.gz.
File metadata
- Download URL: touchstone_platform-1.0.2.tar.gz
- Upload date:
- Size: 578.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1519396dabd6fe39ce2b9b6ab840999ec77d565454759746b12c06548a3a7642
|
|
| MD5 |
c9b3a7104f1d06762dbcd3c3d1213b7a
|
|
| BLAKE2b-256 |
1929178f6cc4acf3eb0c21cbfc4fde1ba7384c2320ef5fe9a4e4336ad7387fe0
|
Provenance
The following attestation bundles were made for touchstone_platform-1.0.2.tar.gz:
Publisher:
release.yml on yadavilli-solutions/touchstone
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
touchstone_platform-1.0.2.tar.gz -
Subject digest:
1519396dabd6fe39ce2b9b6ab840999ec77d565454759746b12c06548a3a7642 - Sigstore transparency entry: 1625090189
- Sigstore integration time:
-
Permalink:
yadavilli-solutions/touchstone@af7df359a86f15cfa7aa5662a2d00686f520bb81 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/yadavilli-solutions
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@af7df359a86f15cfa7aa5662a2d00686f520bb81 -
Trigger Event:
push
-
Statement type:
File details
Details for the file touchstone_platform-1.0.2-py3-none-any.whl.
File metadata
- Download URL: touchstone_platform-1.0.2-py3-none-any.whl
- Upload date:
- Size: 545.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bca10e25cf730444d44d1a1f6cd953c9c24781f01b92691f4b8ec16728c90af
|
|
| MD5 |
579b574a9a3acdbe5b023e66c8095ed9
|
|
| BLAKE2b-256 |
667f6214ac9ed1e077ad018a254248f5b95b5dca0b755038199e4c1cffcc8e58
|
Provenance
The following attestation bundles were made for touchstone_platform-1.0.2-py3-none-any.whl:
Publisher:
release.yml on yadavilli-solutions/touchstone
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
touchstone_platform-1.0.2-py3-none-any.whl -
Subject digest:
8bca10e25cf730444d44d1a1f6cd953c9c24781f01b92691f4b8ec16728c90af - Sigstore transparency entry: 1625090197
- Sigstore integration time:
-
Permalink:
yadavilli-solutions/touchstone@af7df359a86f15cfa7aa5662a2d00686f520bb81 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/yadavilli-solutions
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@af7df359a86f15cfa7aa5662a2d00686f520bb81 -
Trigger Event:
push
-
Statement type: