Skip to main content

Aletheia MBS: measurable structured agent outputs — compile schemas into contracts, validate, trace, gate, and report cost per safe decision

Project description

MBS

MBS makes structured agent behavior measurable.

It compiles schemas into minimal behavioral contracts, validates structured outputs, creates portable traces, reports cost per valid output, and provides a starter benchmark/test surface for structured-output reliability.

30-Second Demo

pip install -e .
mbs demo

Terminal output:

MBS YC demo: structured agent output check

Input prompt:
  Customer says: I think my account was taken over and I cannot sign in.

Check:
  status=FAIL
  failure_type=invalid_enum
  trace_id=mbs_trace_...
  reason=invalid_enum at action
  hint=joined_enum_values

MBS retry repair:
  {"action": "ESCALATE", "priority": "HIGH", "category": "SECURITY", ...}
  status=PASS

Cost/token comparison:
  MBS contract tokens: 62
  Verbose prompt tokens: 164
  Token savings: 62.2%

To regenerate the YC artifacts:

mbs demo --write-artifacts

This writes:

  • benchmarks/results/yc_sample_benchmark.json
  • benchmarks/results/yc_sample_benchmark.md
  • docs/mbs_yc_evidence_brief.md

What MBS Does

MBS is for teams building agents that must produce structured outputs before they call tools, update records, or trigger workflows.

For each structured output, MBS provides:

  • a compact behavioral contract from a schema
  • PASS / FAIL / REVIEW validation
  • exact failure reasons such as invalid_enum, missing_required_key, and wrong_type
  • a trace id with schema, contract, input, and output hashes
  • retry and cost-per-valid-output accounting
  • strict JSON-only diagnostics for models that wrap JSON in prose or emit reasoning text instead of JSON

YC Benchmark Sample

Deterministic local sample: 3 support-agent cases x 2 mock model adapters.

strategy cases models schema-valid semantic-correct avg retries cost / valid output
verbose prompt 6 2 0.500 0.500 0.000 385.0
MBS contract + retry 6 2 1.000 1.000 0.333 121.833

The sample is deliberately small so a reviewer can understand it quickly. The larger MN5/Leonardo benchmark reports are tracked separately.

Current MN5 GPU headline snapshot: 18 open instruction/chat/code/MoE models across 4 schemas and 7 MBS-Lang settings.

benchmark no-retry schema-valid retry schema-valid no-retry semantic retry semantic audit
schema/prompt 0.7321 0.9388 0.8370 0.9203 compare PASS; retry audit finds 1 Mixtral tool-call regression
MBS-Lang 0.9061 0.9603 0.8783 0.9550 PASS, 0 selected-attempt regressions

Reports:

  • benchmarks/results/mn5_eighteen_model_instruction/report_matrix_retry_summary.md
  • benchmarks/results/mn5_eighteen_model_instruction/report_lang_retry_summary.md

The same reports now include clean_json_rate and format_risk, which separate raw JSON compliance from JSON recovered by extraction. In the 18-model retry headline, clean JSON is 0.8810 for schema/prompt and 0.9021 for MBS-Lang. Nemotron 70B is a useful example: it reaches 1.0 schema validity after retry, but its clean-JSON rate is 0 because every output is recoverable JSON wrapped in prose, so MBS labels that as REVIEW instead of a clean PASS.

Broader MN5 stress snapshot: 21 models after adding Phi-4 Mini, DeepSeek R1 Distill Qwen 14B, and DeepSeek R1 Distill Qwen 32B. This wider set keeps weak-model failures visible instead of turning them into a single blended score.

benchmark no-retry schema-valid retry schema-valid no-retry semantic retry semantic audit
schema/prompt 0.6442 0.8535 0.7444 0.8419 PASS, 0 selected-attempt regressions
MBS-Lang 0.8095 0.8519 0.7817 0.8466 PASS, 0 selected-attempt regressions

Latest large-model additions:

model MBS result product signal
Mixtral 8x22B Instruct PASS retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0
OLMo 2 13B Instruct PASS schema/prompt retry is clean JSON 1.0; MBS-Lang is 1.0 / 1.0
Hermes 3 Llama 3.1 70B PASS schema/prompt retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0
Qwen3-30B-A3B Instruct PASS MoE Qwen row: schema/prompt retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0
Mistral Small 3.1 24B REVIEW schema/prompt retry reaches 1.0 schema-valid, but clean JSON remains low because outputs are prose-wrapped
Falcon3 10B Instruct REVIEW retry improves schema validity, but clean JSON is 0.0 because outputs are fenced/prose-wrapped
MiniCPM4 8B REVIEW MBS-Lang retry reaches 1.0 schema-valid, but clean JSON remains near zero
Phi-4 Reasoning Plus FAIL retry cannot recover clean structured output; reasoning prose keeps schema and semantic rates at 0.0
Qwen3 32B FAIL strict/retry improves extraction, but clean JSON remains 0.0 because reasoning/prose dominates
QwQ 32B FAIL retry improves some extracted schema validity, but clean JSON remains 0.0
DeepSeek R1 Distill Llama 70B FAIL schema retry reaches only 0.4583 schema-valid / 0.5833 semantic; clean JSON remains 0.0

Core Commands

mbs demo
mbs compile examples/fintech_transaction_risk/schema.json
mbs compile examples/fintech_transaction_risk/schema.json --format strict
mbs validate --schema examples/fintech_transaction_risk/schema.json --output examples/fintech_transaction_risk/output.json
mbs check --schema examples/fintech_transaction_risk/schema.json --input "Customer transfers 4800 EUR to a new beneficiary" --model mock
mbs bench --config benchmarks/models.yaml
mbs report --results benchmarks/results/*.json --exclude-infra --require-traces --summary-only
  • MBS Compiler: schema to minimal behavioral contract
  • MBS Validate: exact JSON/schema/enum/type failures
  • MBS Check: compile, run or mock, validate, trace, cost summary
  • MBS Trace: audit object for every structured output
  • MBS Cost: cost per valid structured output
  • MBS Bench: repeatable structured-output benchmark starter
  • MBS Test: CI-style structured-output regression command
  • MBS-Lang: hybrid contracts for multilingual structured-output workflows
  • MBS Report: aggregate benchmark JSON into Markdown tables, model scorecards, and failure summaries
  • MBS Compare: detect metric regressions against prior results
  • MBS Models: enforce broad model-suite coverage
  • MBS Triage: inspect remote result directories before scaling GPU runs, with issue summaries, capped terminal output, and case-level failure examples

Prompt styles include natural, progressive, full, and strict. Use strict when evaluating models that tend to emit analysis, markdown, or prose around JSON; MBS records prose_wrapped_json warnings and reasoning_prose failures so those behaviors remain visible in reports.

MBS is not a full agent runtime, an agent marketplace, or an AI operating system.

Docs

  • docs/mbs_yc_evidence_brief.md
  • docs/mbs_quickstart.md
  • docs/mbs_bench.md
  • docs/mbs_lang.md
  • docs/mbs_model_behavior_guidance.md
  • docs/ci.md
  • docs/mbs_hpc_compute_plan.md
  • docs/mbs_remote_stage1_runbook.md
  • docs/mbs_product_quality_plan.md

Positioning

Use PASS / FAIL / REVIEW, exact failure reasons, traces, cost per valid output, and benchmark evidence. Treat numeric reliability scores as experimental until they are calibrated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aletheia_mbs-0.2.0.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aletheia_mbs-0.2.0-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file aletheia_mbs-0.2.0.tar.gz.

File metadata

  • Download URL: aletheia_mbs-0.2.0.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aletheia_mbs-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7abfc945cea4b80088f1a376b3bfa41e21bba49495ada71910ec82b4a01976d3
MD5 60a720573f6eeff5d01336186b34bafb
BLAKE2b-256 1ac96488689aa615c4a3c052fcfd3b9b2c2f28c3284e67486ff2952d3a917a0f

See more details on using hashes here.

File details

Details for the file aletheia_mbs-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: aletheia_mbs-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aletheia_mbs-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 08277a639cbbe5b47d64148c88b41da2e192e476d81540177863d3af548993ee
MD5 4717ce73c2c2dcedcaa8da41997e17de
BLAKE2b-256 0969a970f488515e7aedb3e67cff7f59b7e61b00eeda0e6eeb468562234e6c57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page