Aletheia MBS: measurable structured agent outputs — compile schemas into contracts, validate, trace, gate, and report cost per safe decision
Project description
MBS
MBS makes structured agent behavior measurable.
It compiles schemas into minimal behavioral contracts, validates structured outputs, creates portable traces, reports cost per valid output, and provides a starter benchmark/test surface for structured-output reliability.
30-Second Demo
pip install -e .
mbs demo
Terminal output:
MBS YC demo: structured agent output check
Input prompt:
Customer says: I think my account was taken over and I cannot sign in.
Check:
status=FAIL
failure_type=invalid_enum
trace_id=mbs_trace_...
reason=invalid_enum at action
hint=joined_enum_values
MBS retry repair:
{"action": "ESCALATE", "priority": "HIGH", "category": "SECURITY", ...}
status=PASS
Cost/token comparison:
MBS contract tokens: 62
Verbose prompt tokens: 164
Token savings: 62.2%
To regenerate the YC artifacts:
mbs demo --write-artifacts
This writes:
benchmarks/results/yc_sample_benchmark.jsonbenchmarks/results/yc_sample_benchmark.mddocs/mbs_yc_evidence_brief.md
What MBS Does
MBS is for teams building agents that must produce structured outputs before they call tools, update records, or trigger workflows.
For each structured output, MBS provides:
- a compact behavioral contract from a schema
- PASS / FAIL / REVIEW validation
- exact failure reasons such as
invalid_enum,missing_required_key, andwrong_type - a trace id with schema, contract, input, and output hashes
- retry and cost-per-valid-output accounting
- strict JSON-only diagnostics for models that wrap JSON in prose or emit reasoning text instead of JSON
YC Benchmark Sample
Deterministic local sample: 3 support-agent cases x 2 mock model adapters.
| strategy | cases | models | schema-valid | semantic-correct | avg retries | cost / valid output |
|---|---|---|---|---|---|---|
| verbose prompt | 6 | 2 | 0.500 | 0.500 | 0.000 | 385.0 |
| MBS contract + retry | 6 | 2 | 1.000 | 1.000 | 0.333 | 121.833 |
The sample is deliberately small so a reviewer can understand it quickly. The larger MN5/Leonardo benchmark reports are tracked separately.
Current MN5 GPU headline snapshot: 18 open instruction/chat/code/MoE models across 4 schemas and 7 MBS-Lang settings.
| benchmark | no-retry schema-valid | retry schema-valid | no-retry semantic | retry semantic | audit |
|---|---|---|---|---|---|
| schema/prompt | 0.7321 | 0.9388 | 0.8370 | 0.9203 | compare PASS; retry audit finds 1 Mixtral tool-call regression |
| MBS-Lang | 0.9061 | 0.9603 | 0.8783 | 0.9550 | PASS, 0 selected-attempt regressions |
Reports:
benchmarks/results/mn5_eighteen_model_instruction/report_matrix_retry_summary.mdbenchmarks/results/mn5_eighteen_model_instruction/report_lang_retry_summary.md
The same reports now include clean_json_rate and format_risk, which separate
raw JSON compliance from JSON recovered by extraction. In the 18-model retry
headline, clean JSON is 0.8810 for schema/prompt and 0.9021 for MBS-Lang.
Nemotron 70B is a useful example: it reaches 1.0 schema validity after retry,
but its clean-JSON rate is 0 because every output is recoverable JSON wrapped
in prose, so MBS labels that as REVIEW instead of a clean PASS.
Broader MN5 stress snapshot: 21 models after adding Phi-4 Mini, DeepSeek R1 Distill Qwen 14B, and DeepSeek R1 Distill Qwen 32B. This wider set keeps weak-model failures visible instead of turning them into a single blended score.
| benchmark | no-retry schema-valid | retry schema-valid | no-retry semantic | retry semantic | audit |
|---|---|---|---|---|---|
| schema/prompt | 0.6442 | 0.8535 | 0.7444 | 0.8419 | PASS, 0 selected-attempt regressions |
| MBS-Lang | 0.8095 | 0.8519 | 0.7817 | 0.8466 | PASS, 0 selected-attempt regressions |
Latest large-model additions:
| model | MBS result | product signal |
|---|---|---|
| Mixtral 8x22B Instruct | PASS | retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0 |
| OLMo 2 13B Instruct | PASS | schema/prompt retry is clean JSON 1.0; MBS-Lang is 1.0 / 1.0 |
| Hermes 3 Llama 3.1 70B | PASS | schema/prompt retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0 |
| Qwen3-30B-A3B Instruct | PASS | MoE Qwen row: schema/prompt retry reaches 1.0 schema-valid and clean JSON; MBS-Lang is 1.0 / 1.0 |
| Mistral Small 3.1 24B | REVIEW | schema/prompt retry reaches 1.0 schema-valid, but clean JSON remains low because outputs are prose-wrapped |
| Falcon3 10B Instruct | REVIEW | retry improves schema validity, but clean JSON is 0.0 because outputs are fenced/prose-wrapped |
| MiniCPM4 8B | REVIEW | MBS-Lang retry reaches 1.0 schema-valid, but clean JSON remains near zero |
| Phi-4 Reasoning Plus | FAIL | retry cannot recover clean structured output; reasoning prose keeps schema and semantic rates at 0.0 |
| Qwen3 32B | FAIL | strict/retry improves extraction, but clean JSON remains 0.0 because reasoning/prose dominates |
| QwQ 32B | FAIL | retry improves some extracted schema validity, but clean JSON remains 0.0 |
| DeepSeek R1 Distill Llama 70B | FAIL | schema retry reaches only 0.4583 schema-valid / 0.5833 semantic; clean JSON remains 0.0 |
Core Commands
mbs demo
mbs compile examples/fintech_transaction_risk/schema.json
mbs compile examples/fintech_transaction_risk/schema.json --format strict
mbs validate --schema examples/fintech_transaction_risk/schema.json --output examples/fintech_transaction_risk/output.json
mbs check --schema examples/fintech_transaction_risk/schema.json --input "Customer transfers 4800 EUR to a new beneficiary" --model mock
mbs bench --config benchmarks/models.yaml
mbs report --results benchmarks/results/*.json --exclude-infra --require-traces --summary-only
MBS Compiler: schema to minimal behavioral contractMBS Validate: exact JSON/schema/enum/type failuresMBS Check: compile, run or mock, validate, trace, cost summaryMBS Trace: audit object for every structured outputMBS Cost: cost per valid structured outputMBS Bench: repeatable structured-output benchmark starterMBS Test: CI-style structured-output regression commandMBS-Lang: hybrid contracts for multilingual structured-output workflowsMBS Report: aggregate benchmark JSON into Markdown tables, model scorecards, and failure summariesMBS Compare: detect metric regressions against prior resultsMBS Models: enforce broad model-suite coverageMBS Triage: inspect remote result directories before scaling GPU runs, with issue summaries, capped terminal output, and case-level failure examples
Prompt styles include natural, progressive, full, and strict. Use
strict when evaluating models that tend to emit analysis, markdown, or prose
around JSON; MBS records prose_wrapped_json warnings and reasoning_prose
failures so those behaviors remain visible in reports.
MBS is not a full agent runtime, an agent marketplace, or an AI operating system.
Docs
docs/mbs_yc_evidence_brief.mddocs/mbs_quickstart.mddocs/mbs_bench.mddocs/mbs_lang.mddocs/mbs_model_behavior_guidance.mddocs/ci.mddocs/mbs_hpc_compute_plan.mddocs/mbs_remote_stage1_runbook.mddocs/mbs_product_quality_plan.md
Positioning
Use PASS / FAIL / REVIEW, exact failure reasons, traces, cost per valid output, and benchmark evidence. Treat numeric reliability scores as experimental until they are calibrated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aletheia_mbs-0.2.0.tar.gz.
File metadata
- Download URL: aletheia_mbs-0.2.0.tar.gz
- Upload date:
- Size: 52.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7abfc945cea4b80088f1a376b3bfa41e21bba49495ada71910ec82b4a01976d3
|
|
| MD5 |
60a720573f6eeff5d01336186b34bafb
|
|
| BLAKE2b-256 |
1ac96488689aa615c4a3c052fcfd3b9b2c2f28c3284e67486ff2952d3a917a0f
|
File details
Details for the file aletheia_mbs-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aletheia_mbs-0.2.0-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08277a639cbbe5b47d64148c88b41da2e192e476d81540177863d3af548993ee
|
|
| MD5 |
4717ce73c2c2dcedcaa8da41997e17de
|
|
| BLAKE2b-256 |
0969a970f488515e7aedb3e67cff7f59b7e61b00eeda0e6eeb468562234e6c57
|