Touchstone — the AI Adoption Lifecycle platform. AI with receipts: forge it in the Crucible, ship what proves.

These details have not been verified by PyPI

Project description

Touchstone

Touchstone — AI with receipts. Forge it in the Crucible. Ship what proves.

Touchstone is the umbrella AI-certification platform. Crucible is its HELM-style benchmark engine: multi-leaderboard AI benchmarking for enterprise — legacy modernization, healthcare AI, frontier safety, agentic tool-use, code patch generation, and per-vertical / per-regulation evaluation, all under one FastAPI service with shared persistence, an HMAC-chained audit ledger, evidence-bundle export, and an Inspect-AI-compatible eval-log exporter.

40 leaderboards. 19+ metrics. 780+ tests green. Last sync: 2026-05-17 (Wave 9 Track A/B).

What you can do today

Source: docs/diagrams/request-flow.mmd — render via npx -y @mermaid-js/mermaid-cli -i docs/diagrams/request-flow.mmd -o request-flow.svg

flowchart LR
    User["User / CI / dashboard"] --> API["FastAPI<br/>app/main.py"]

    API --> Core["Crucible<br/>HELM-style eval-core<br/>app/eval/"]
    API --> Legacy["Legacy<br/>GymSpec runtime<br/>app/engine.py"]

    subgraph Boards["40 leaderboards (auto-discovered)"]
      Modern["Modernization × 5<br/>classic / evidence / robustness<br/>agentic v1.2 / safety"]
      SWE["SWE-Bench Verified<br/>(struct + Docker exec)"]
      Med["MedHELM (10 RunSpecs)"]
      Vert["Per-vertical × 17<br/>HCLS · retail · fin-svc<br/>hi-tech · industry · edu-K12"]
      Comp["Compliance × 9<br/>HIPAA · PCI-DSS · SOX · GDPR<br/>CMS · HTI-1 · 3 state AI<br/>OWASP × 2 · MITRE ATLAS"]
      Safe["Frontier safety × 2<br/>WMDP · medical_safety<br/>+ agent_threat_safety"]
    end

    Core --> Boards

    Boards --> Snap["Postgres :25432<br/>leaderboard_snapshots"]
    Boards --> Traces["traces/<run_id>/*.json"]
    Boards --> Ledger["audit_ledger.jsonl<br/>(HMAC chained)"]

    Snap --> Bundle["Evidence bundle ZIP<br/>/evidence_bundle"]
    Traces --> Bundle
    Ledger --> Bundle

    Snap --> Inspect["Inspect AI EvalLog v2<br/>/inspect_log"]
    Traces --> Inspect

Run any of 30 leaderboards against any model with hash-keyed request caching
Stream HuggingFace datasets with access_tier: public | gated | private (set HF_TOKEN for gated, MODELGYM_HF_OFFLINE=1 for airgap / CI)
Score with 19+ metrics across 7 groups (core, evidence, HCLS, compliance, agentic, safety, code)
Export an Inspect-AI-compatible EvalLog v2 JSON for any run
Export an evidence bundle (snapshot + traces + audit slice + scenario fingerprints) per run
Drive the legacy GymSpec onboarding/certification path on the same store

Leaderboard catalog

Category	ID prefix	Boards	Headline metric
Modernization (core)	`modernization_*`	5	varies per board
· classic	`modernization_classic`	1	exact_match + judge
· evidence	`modernization_evidence`	1	citation_f1 + hallucination_rate
· robustness	`modernization_robustness`	1	robustness_delta
· agentic	`modernization_agentic` v1.2	1	step_success_rate + action_set_jaccard + task_completed_partial (TAU-Bench live mocks)
· safety	`modernization_safety`	1	wmdp_safety (inverted)
SWE-Bench Verified	`swe_bench_verified` v1.1	1	patch_files_match_gold + swe_bench_resolved (opt-in Docker)
MedHELM	`medhelm_modernization`	1	jury_score + bertscore
Vertical	`vertical_*`	17	per-vertical jury_score
Compliance	`compliance_*`	9	control_match (HIPAA, PCI-DSS, SOX, GDPR, CMS-Interop, HTI-1, NYC LL 144, CA AI laws, IL AI laws + OWASP LLM Top 10, OWASP Agentic, MITRE ATLAS)
Frontier safety	`modernization_safety`, `medical_safety`	2	wmdp_safety, red_team_safety (inverted)
Agent threat	`agent_threat_safety`	1	atr_safety (ATR rule pattern coverage)

For the full registry: uv run python scripts/export_registry.py.

Quick start

# install (uv is the toolchain)
uv sync

# start Postgres on 25432 + apply migrations
docker compose up -d db
uv run alembic upgrade head

# run the full eval test suite (offline by default)
MODELGYM_HF_OFFLINE=1 uv run pytest tests/eval/ -q

# boot the API + dashboard
uv run uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/leaderboards.html
# → http://localhost:8000/dashboard         (legacy GymSpec UI)

Optional live data

# pull real HF datasets (MedHELM, WMDP, SWE-Bench Verified, etc.)
unset MODELGYM_HF_OFFLINE
export HF_TOKEN=hf_xxx      # only needed for gated-tier scenarios
uv run pytest tests/eval/test_medhelm_real_data.py -q

Install

# Canonical package name (formerly model-gym):
pip install touchstone-platform

# Docker image:
docker pull ghcr.io/yadavilli-solutions/touchstone

See MIGRATION_TO_TOUCHSTONE.md if you are upgrading from model-gym.

Environment variables

Var	New name (TOUCHSTONE_*)	Purpose	Required in prod
`DATABASE_URL`	—	Postgres DSN (default: `:25432/modelgym`)	yes
`AUDIT_HMAC_SECRET`	—	HMAC key for audit ledger chain	yes
`MODELGYM_AUDIT_LEDGER` (deprecated; use `TOUCHSTONE_AUDIT_LEDGER`)	`TOUCHSTONE_AUDIT_LEDGER`	Path to append-only ledger JSONL	yes
`MODELGYM_TRACE_ROOT` (deprecated; use `TOUCHSTONE_TRACE_ROOT`)	`TOUCHSTONE_TRACE_ROOT`	Root dir for per-instance trace JSON	yes
`MODELGYM_REQUEST_CACHE_PATH` (deprecated; use `TOUCHSTONE_REQUEST_CACHE_PATH`)	`TOUCHSTONE_REQUEST_CACHE_PATH`	SQLite path for hash-keyed model-call cache	yes
`MODELGYM_HF_OFFLINE` (deprecated; use `TOUCHSTONE_HF_OFFLINE`)	`TOUCHSTONE_HF_OFFLINE`	`1` to forbid HF network calls during tests/airgap	no
`HF_TOKEN`	—	HuggingFace token — required for `access_tier=gated`	when gated used

Adding a new benchmark

Drop a scenario class in app/eval/specs/<your_thing>.py (subclass HuggingFaceDatasetScenario for HF, or implement Scenario directly).
Drop a leaderboard registration in app/eval/leaderboards/<your_thing>.py (calls REGISTRY.register(Leaderboard(...)) at import time).
FastAPI auto-discovers it on next boot. Add a test under tests/eval/test_<your_thing>.py.

For a worked example, see app/eval/specs/wmdp_scenarios.py + app/eval/leaderboards/modernization_safety.py + tests/eval/test_wmdp.py.

Documentation map

architecture.md — full architectural truth (read this first when extending)
ROADMAP.md — what shipped, what's next
MIGRATION_TO_TOUCHSTONE.md — rename guide
docs/WAVE_LOG.md — Plans 1-7 vs ship reality + Waves 1-5
docs/superpowers/plans/ — original written plans (superseded; WAVE_LOG is canonical)
PATH_TO_PROD.md — known prod-readiness boundaries

Boundaries (not yet production-safe)

X-Touchstone-Role header (formerly X-Gym-Role) is dev-mode only. Set MODELGYM_ENV=production (or TOUCHSTONE_ENV=production) to fail-close — Keycloak is then required (or MODELGYM_ALLOW_DEV_AUTH=1 / TOUCHSTONE_ALLOW_DEV_AUTH=1 for airgap deploys that issue their own local JWTs).
SWE-Bench Docker execution (FAIL_TO_PASS / PASS_TO_PASS) is opt-in via MODELGYM_SWE_BENCH_EXEC=1 (or TOUCHSTONE_SWE_BENCH_EXEC=1) + a docker daemon + the upstream swebench/sweb.eval.* per-instance images. Without it, scoring is structural-only.
WMDP scoring remains structural (frontier-safety MCQ). TAU-Bench has live retail + airline mocks (Waves 7.4 + 7.5); 4 of 12 vendored tasks have gold_final_state for TaskCompletionMetric — remaining authoring is rolling content work.
BERTScore falls back to rouge_L when the bert_score package isn't installed (set MODELGYM_BERTSCORE_REQUIRE_REAL=1 / TOUCHSTONE_BERTSCORE_REQUIRE_REAL=1 to fail-close).
Inspect AI export emits both .json and a real in-tree .eval zipfile (Wave 7.1) — no inspect-ai dep required.

License

See LICENSE.md.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.2

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

touchstone_platform-1.0.2.tar.gz (578.5 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

touchstone_platform-1.0.2-py3-none-any.whl (545.3 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file touchstone_platform-1.0.2.tar.gz.

File metadata

Download URL: touchstone_platform-1.0.2.tar.gz
Upload date: May 24, 2026
Size: 578.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_platform-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1519396dabd6fe39ce2b9b6ab840999ec77d565454759746b12c06548a3a7642`
MD5	`c9b3a7104f1d06762dbcd3c3d1213b7a`
BLAKE2b-256	`1929178f6cc4acf3eb0c21cbfc4fde1ba7384c2320ef5fe9a4e4336ad7387fe0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_platform-1.0.2.tar.gz:

Publisher: release.yml on yadavilli-solutions/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: touchstone_platform-1.0.2.tar.gz
- Subject digest: 1519396dabd6fe39ce2b9b6ab840999ec77d565454759746b12c06548a3a7642
- Sigstore transparency entry: 1625090189
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: yadavilli-solutions/touchstone@af7df359a86f15cfa7aa5662a2d00686f520bb81
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/yadavilli-solutions
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@af7df359a86f15cfa7aa5662a2d00686f520bb81
- Trigger Event: push

File details

Details for the file touchstone_platform-1.0.2-py3-none-any.whl.

File metadata

Download URL: touchstone_platform-1.0.2-py3-none-any.whl
Upload date: May 24, 2026
Size: 545.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_platform-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8bca10e25cf730444d44d1a1f6cd953c9c24781f01b92691f4b8ec16728c90af`
MD5	`579b574a9a3acdbe5b023e66c8095ed9`
BLAKE2b-256	`667f6214ac9ed1e077ad018a254248f5b95b5dca0b755038199e4c1cffcc8e58`

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_platform-1.0.2-py3-none-any.whl:

Publisher: release.yml on yadavilli-solutions/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: touchstone_platform-1.0.2-py3-none-any.whl
- Subject digest: 8bca10e25cf730444d44d1a1f6cd953c9c24781f01b92691f4b8ec16728c90af
- Sigstore transparency entry: 1625090197
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: yadavilli-solutions/touchstone@af7df359a86f15cfa7aa5662a2d00686f520bb81
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/yadavilli-solutions
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@af7df359a86f15cfa7aa5662a2d00686f520bb81
- Trigger Event: push

touchstone-platform 1.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Touchstone

What you can do today

Leaderboard catalog

Quick start

Optional live data

Install

Environment variables

Adding a new benchmark

Documentation map

Boundaries (not yet production-safe)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance