Skip to main content

Touchstone — the AI Adoption Lifecycle platform. AI with receipts: forge it in the Crucible, ship what proves.

Project description

Touchstone

Touchstone — AI with receipts. Forge it in the Crucible. Ship what proves.

Touchstone is the umbrella AI-certification platform. Crucible is its HELM-style benchmark engine: multi-leaderboard AI benchmarking for enterprise — legacy modernization, healthcare AI, frontier safety, agentic tool-use, code patch generation, and per-vertical / per-regulation evaluation, all under one FastAPI service with shared persistence, an HMAC-chained audit ledger, evidence-bundle export, and an Inspect-AI-compatible eval-log exporter.

40 leaderboards. 19+ metrics. 780+ tests green. Last sync: 2026-05-17 (Wave 9 Track A/B).


What you can do today

Source: docs/diagrams/request-flow.mmd — render via npx -y @mermaid-js/mermaid-cli -i docs/diagrams/request-flow.mmd -o request-flow.svg

flowchart LR
    User["User / CI / dashboard"] --> API["FastAPI<br/>app/main.py"]

    API --> Core["Crucible<br/>HELM-style eval-core<br/>app/eval/"]
    API --> Legacy["Legacy<br/>GymSpec runtime<br/>app/engine.py"]

    subgraph Boards["40 leaderboards (auto-discovered)"]
      Modern["Modernization × 5<br/>classic / evidence / robustness<br/>agentic v1.2 / safety"]
      SWE["SWE-Bench Verified<br/>(struct + Docker exec)"]
      Med["MedHELM (10 RunSpecs)"]
      Vert["Per-vertical × 17<br/>HCLS · retail · fin-svc<br/>hi-tech · industry · edu-K12"]
      Comp["Compliance × 9<br/>HIPAA · PCI-DSS · SOX · GDPR<br/>CMS · HTI-1 · 3 state AI<br/>OWASP × 2 · MITRE ATLAS"]
      Safe["Frontier safety × 2<br/>WMDP · medical_safety<br/>+ agent_threat_safety"]
    end

    Core --> Boards

    Boards --> Snap["Postgres :25432<br/>leaderboard_snapshots"]
    Boards --> Traces["traces/<run_id>/*.json"]
    Boards --> Ledger["audit_ledger.jsonl<br/>(HMAC chained)"]

    Snap --> Bundle["Evidence bundle ZIP<br/>/evidence_bundle"]
    Traces --> Bundle
    Ledger --> Bundle

    Snap --> Inspect["Inspect AI EvalLog v2<br/>/inspect_log"]
    Traces --> Inspect
  • Run any of 30 leaderboards against any model with hash-keyed request caching
  • Stream HuggingFace datasets with access_tier: public | gated | private (set HF_TOKEN for gated, MODELGYM_HF_OFFLINE=1 for airgap / CI)
  • Score with 19+ metrics across 7 groups (core, evidence, HCLS, compliance, agentic, safety, code)
  • Export an Inspect-AI-compatible EvalLog v2 JSON for any run
  • Export an evidence bundle (snapshot + traces + audit slice + scenario fingerprints) per run
  • Drive the legacy GymSpec onboarding/certification path on the same store

Leaderboard catalog

Category ID prefix Boards Headline metric
Modernization (core) modernization_* 5 varies per board
· classic modernization_classic 1 exact_match + judge
· evidence modernization_evidence 1 citation_f1 + hallucination_rate
· robustness modernization_robustness 1 robustness_delta
· agentic modernization_agentic v1.2 1 step_success_rate + action_set_jaccard + task_completed_partial (TAU-Bench live mocks)
· safety modernization_safety 1 wmdp_safety (inverted)
SWE-Bench Verified swe_bench_verified v1.1 1 patch_files_match_gold + swe_bench_resolved (opt-in Docker)
MedHELM medhelm_modernization 1 jury_score + bertscore
Vertical vertical_* 17 per-vertical jury_score
Compliance compliance_* 9 control_match (HIPAA, PCI-DSS, SOX, GDPR, CMS-Interop, HTI-1, NYC LL 144, CA AI laws, IL AI laws + OWASP LLM Top 10, OWASP Agentic, MITRE ATLAS)
Frontier safety modernization_safety, medical_safety 2 wmdp_safety, red_team_safety (inverted)
Agent threat agent_threat_safety 1 atr_safety (ATR rule pattern coverage)

For the full registry: uv run python scripts/export_registry.py.


Quick start

# install (uv is the toolchain)
uv sync

# start Postgres on 25432 + apply migrations
docker compose up -d db
uv run alembic upgrade head

# run the full eval test suite (offline by default)
MODELGYM_HF_OFFLINE=1 uv run pytest tests/eval/ -q

# boot the API + dashboard
uv run uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/leaderboards.html
# → http://localhost:8000/dashboard         (legacy GymSpec UI)

Optional live data

# pull real HF datasets (MedHELM, WMDP, SWE-Bench Verified, etc.)
unset MODELGYM_HF_OFFLINE
export HF_TOKEN=hf_xxx      # only needed for gated-tier scenarios
uv run pytest tests/eval/test_medhelm_real_data.py -q

Install

# Canonical package name (formerly model-gym):
pip install touchstone-platform

# Docker image:
docker pull ghcr.io/yadavilli-solutions/touchstone

See MIGRATION_TO_TOUCHSTONE.md if you are upgrading from model-gym.


Environment variables

Var New name (TOUCHSTONE_*) Purpose Required in prod
DATABASE_URL Postgres DSN (default: :25432/modelgym) yes
AUDIT_HMAC_SECRET HMAC key for audit ledger chain yes
MODELGYM_AUDIT_LEDGER (deprecated; use TOUCHSTONE_AUDIT_LEDGER) TOUCHSTONE_AUDIT_LEDGER Path to append-only ledger JSONL yes
MODELGYM_TRACE_ROOT (deprecated; use TOUCHSTONE_TRACE_ROOT) TOUCHSTONE_TRACE_ROOT Root dir for per-instance trace JSON yes
MODELGYM_REQUEST_CACHE_PATH (deprecated; use TOUCHSTONE_REQUEST_CACHE_PATH) TOUCHSTONE_REQUEST_CACHE_PATH SQLite path for hash-keyed model-call cache yes
MODELGYM_HF_OFFLINE (deprecated; use TOUCHSTONE_HF_OFFLINE) TOUCHSTONE_HF_OFFLINE 1 to forbid HF network calls during tests/airgap no
HF_TOKEN HuggingFace token — required for access_tier=gated when gated used

Adding a new benchmark

  1. Drop a scenario class in app/eval/specs/<your_thing>.py (subclass HuggingFaceDatasetScenario for HF, or implement Scenario directly).
  2. Drop a leaderboard registration in app/eval/leaderboards/<your_thing>.py (calls REGISTRY.register(Leaderboard(...)) at import time).
  3. FastAPI auto-discovers it on next boot. Add a test under tests/eval/test_<your_thing>.py.

For a worked example, see app/eval/specs/wmdp_scenarios.py + app/eval/leaderboards/modernization_safety.py + tests/eval/test_wmdp.py.


Documentation map


Boundaries (not yet production-safe)

  • X-Touchstone-Role header (formerly X-Gym-Role) is dev-mode only. Set MODELGYM_ENV=production (or TOUCHSTONE_ENV=production) to fail-close — Keycloak is then required (or MODELGYM_ALLOW_DEV_AUTH=1 / TOUCHSTONE_ALLOW_DEV_AUTH=1 for airgap deploys that issue their own local JWTs).
  • SWE-Bench Docker execution (FAIL_TO_PASS / PASS_TO_PASS) is opt-in via MODELGYM_SWE_BENCH_EXEC=1 (or TOUCHSTONE_SWE_BENCH_EXEC=1) + a docker daemon + the upstream swebench/sweb.eval.* per-instance images. Without it, scoring is structural-only.
  • WMDP scoring remains structural (frontier-safety MCQ). TAU-Bench has live retail + airline mocks (Waves 7.4 + 7.5); 4 of 12 vendored tasks have gold_final_state for TaskCompletionMetric — remaining authoring is rolling content work.
  • BERTScore falls back to rouge_L when the bert_score package isn't installed (set MODELGYM_BERTSCORE_REQUIRE_REAL=1 / TOUCHSTONE_BERTSCORE_REQUIRE_REAL=1 to fail-close).
  • Inspect AI export emits both .json and a real in-tree .eval zipfile (Wave 7.1) — no inspect-ai dep required.

License

See LICENSE.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

touchstone_platform-1.0.2.tar.gz (578.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

touchstone_platform-1.0.2-py3-none-any.whl (545.3 kB view details)

Uploaded Python 3

File details

Details for the file touchstone_platform-1.0.2.tar.gz.

File metadata

  • Download URL: touchstone_platform-1.0.2.tar.gz
  • Upload date:
  • Size: 578.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_platform-1.0.2.tar.gz
Algorithm Hash digest
SHA256 1519396dabd6fe39ce2b9b6ab840999ec77d565454759746b12c06548a3a7642
MD5 c9b3a7104f1d06762dbcd3c3d1213b7a
BLAKE2b-256 1929178f6cc4acf3eb0c21cbfc4fde1ba7384c2320ef5fe9a4e4336ad7387fe0

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_platform-1.0.2.tar.gz:

Publisher: release.yml on yadavilli-solutions/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file touchstone_platform-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for touchstone_platform-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8bca10e25cf730444d44d1a1f6cd953c9c24781f01b92691f4b8ec16728c90af
MD5 579b574a9a3acdbe5b023e66c8095ed9
BLAKE2b-256 667f6214ac9ed1e077ad018a254248f5b95b5dca0b755038199e4c1cffcc8e58

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_platform-1.0.2-py3-none-any.whl:

Publisher: release.yml on yadavilli-solutions/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page