Aletheia MBS: measurable structured agent outputs — compile schemas into contracts, validate, trace, gate, and report cost per safe decision

These details have not been verified by PyPI

Project links

Project description

MBS

MBS makes structured agent behavior measurable.

It compiles schemas into minimal behavioral contracts, validates structured outputs, creates portable traces, reports cost per valid output, and provides a starter benchmark/test surface for structured-output reliability.

30-Second Demo

pip install -e .
mbs demo

Terminal output:

MBS YC demo: structured agent output check

Input prompt:
  Customer says: I think my account was taken over and I cannot sign in.

Check:
  status=FAIL
  failure_type=invalid_enum
  trace_id=mbs_trace_...
  reason=invalid_enum at action
  hint=joined_enum_values

MBS retry repair:
  {"action": "ESCALATE", "priority": "HIGH", "category": "SECURITY", ...}
  status=PASS

Cost/token comparison:
  MBS contract tokens: 62
  Verbose prompt tokens: 164
  Token savings: 62.2%

To regenerate the YC artifacts:

mbs demo --write-artifacts

This writes:

benchmarks/results/yc_sample_benchmark.json
benchmarks/results/yc_sample_benchmark.md
docs/mbs_yc_evidence_brief.md

What MBS Does

MBS is for teams building agents that must produce structured outputs before they call tools, update records, or trigger workflows.

For each structured output, MBS provides:

a compact behavioral contract from a schema
PASS / FAIL / REVIEW validation
exact failure reasons such as invalid_enum, missing_required_key, and wrong_type
a trace id with schema, contract, input, and output hashes
retry and cost-per-valid-output accounting
strict JSON-only diagnostics for models that wrap JSON in prose or emit reasoning text instead of JSON

YC Benchmark Sample

Deterministic local sample: 3 support-agent cases x 2 mock model adapters.

strategy	cases	models	schema-valid	semantic-correct	avg retries	cost / valid output
verbose prompt	6	2	0.500	0.500	0.000	385.0
MBS contract + retry	6	2	1.000	1.000	0.333	121.833

The sample is deliberately small so a reviewer can understand it quickly. The larger MN5/Leonardo benchmark reports are tracked separately.

Current MN5 GPU headline snapshot: 18 open instruction/chat/code/MoE models across 4 schemas and 7 MBS-Lang settings.

benchmark	no-retry schema-valid	retry schema-valid	no-retry semantic	retry semantic	audit
schema/prompt	0.7321	0.9388	0.8370	0.9203	compare PASS; retry audit finds 1 Mixtral tool-call regression
MBS-Lang	0.9061	0.9603	0.8783	0.9550	PASS, 0 selected-attempt regressions

Reports:

benchmarks/results/mn5_eighteen_model_instruction/report_matrix_retry_summary.md
benchmarks/results/mn5_eighteen_model_instruction/report_lang_retry_summary.md

The same reports now include clean_json_rate and format_risk, which separate raw JSON compliance from JSON recovered by extraction. In the 18-model retry headline, clean JSON is 0.8810 for schema/prompt and 0.9021 for MBS-Lang. Nemotron 70B is a useful example: it reaches 1.0 schema validity after retry, but its clean-JSON rate is 0 because every output is recoverable JSON wrapped in prose, so MBS labels that as REVIEW instead of a clean PASS.

Broader MN5 stress snapshot: 21 models after adding Phi-4 Mini, DeepSeek R1 Distill Qwen 14B, and DeepSeek R1 Distill Qwen 32B. This wider set keeps weak-model failures visible instead of turning them into a single blended score.

benchmark	no-retry schema-valid	retry schema-valid	no-retry semantic	retry semantic	audit
schema/prompt	0.6442	0.8535	0.7444	0.8419	PASS, 0 selected-attempt regressions
MBS-Lang	0.8095	0.8519	0.7817	0.8466	PASS, 0 selected-attempt regressions

Latest large-model additions:

model	MBS result	product signal
Mixtral 8x22B Instruct	PASS	retry reaches `1.0` schema-valid and clean JSON; MBS-Lang is `1.0` / `1.0`
OLMo 2 13B Instruct	PASS	schema/prompt retry is clean JSON `1.0`; MBS-Lang is `1.0` / `1.0`
Hermes 3 Llama 3.1 70B	PASS	schema/prompt retry reaches `1.0` schema-valid and clean JSON; MBS-Lang is `1.0` / `1.0`
Qwen3-30B-A3B Instruct	PASS	MoE Qwen row: schema/prompt retry reaches `1.0` schema-valid and clean JSON; MBS-Lang is `1.0` / `1.0`
Mistral Small 3.1 24B	REVIEW	schema/prompt retry reaches `1.0` schema-valid, but clean JSON remains low because outputs are prose-wrapped
Falcon3 10B Instruct	REVIEW	retry improves schema validity, but clean JSON is `0.0` because outputs are fenced/prose-wrapped
MiniCPM4 8B	REVIEW	MBS-Lang retry reaches `1.0` schema-valid, but clean JSON remains near zero
Phi-4 Reasoning Plus	FAIL	retry cannot recover clean structured output; reasoning prose keeps schema and semantic rates at `0.0`
Qwen3 32B	FAIL	strict/retry improves extraction, but clean JSON remains `0.0` because reasoning/prose dominates
QwQ 32B	FAIL	retry improves some extracted schema validity, but clean JSON remains `0.0`
DeepSeek R1 Distill Llama 70B	FAIL	schema retry reaches only `0.4583` schema-valid / `0.5833` semantic; clean JSON remains `0.0`

Core Commands

mbs demo
mbs compile examples/fintech_transaction_risk/schema.json
mbs compile examples/fintech_transaction_risk/schema.json --format strict
mbs validate --schema examples/fintech_transaction_risk/schema.json --output examples/fintech_transaction_risk/output.json
mbs check --schema examples/fintech_transaction_risk/schema.json --input "Customer transfers 4800 EUR to a new beneficiary" --model mock
mbs bench --config benchmarks/models.yaml
mbs report --results benchmarks/results/*.json --exclude-infra --require-traces --summary-only

MBS Compiler: schema to minimal behavioral contract
MBS Validate: exact JSON/schema/enum/type failures
MBS Check: compile, run or mock, validate, trace, cost summary
MBS Trace: audit object for every structured output
MBS Cost: cost per valid structured output
MBS Bench: repeatable structured-output benchmark starter
MBS Test: CI-style structured-output regression command
MBS-Lang: hybrid contracts for multilingual structured-output workflows
MBS Report: aggregate benchmark JSON into Markdown tables, model scorecards, and failure summaries
MBS Compare: detect metric regressions against prior results
MBS Models: enforce broad model-suite coverage
MBS Triage: inspect remote result directories before scaling GPU runs, with issue summaries, capped terminal output, and case-level failure examples

Prompt styles include natural, progressive, full, and strict. Use strict when evaluating models that tend to emit analysis, markdown, or prose around JSON; MBS records prose_wrapped_json warnings and reasoning_prose failures so those behaviors remain visible in reports.

MBS is not a full agent runtime, an agent marketplace, or an AI operating system.

Docs

docs/mbs_yc_evidence_brief.md
docs/mbs_quickstart.md
docs/mbs_bench.md
docs/mbs_lang.md
docs/mbs_model_behavior_guidance.md
docs/ci.md
docs/mbs_hpc_compute_plan.md
docs/mbs_remote_stage1_runbook.md
docs/mbs_product_quality_plan.md

Positioning

Use PASS / FAIL / REVIEW, exact failure reasons, traces, cost per valid output, and benchmark evidence. Treat numeric reliability scores as experimental until they are calibrated.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Jun 27, 2026

1.0.0

Jun 27, 2026

This version

0.2.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aletheia_mbs-0.2.0.tar.gz (52.4 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aletheia_mbs-0.2.0-py3-none-any.whl (54.5 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file aletheia_mbs-0.2.0.tar.gz.

File metadata

Download URL: aletheia_mbs-0.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 52.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aletheia_mbs-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7abfc945cea4b80088f1a376b3bfa41e21bba49495ada71910ec82b4a01976d3`
MD5	`60a720573f6eeff5d01336186b34bafb`
BLAKE2b-256	`1ac96488689aa615c4a3c052fcfd3b9b2c2f28c3284e67486ff2952d3a917a0f`

See more details on using hashes here.

File details

Details for the file aletheia_mbs-0.2.0-py3-none-any.whl.

File metadata

Download URL: aletheia_mbs-0.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 54.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for aletheia_mbs-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08277a639cbbe5b47d64148c88b41da2e192e476d81540177863d3af548993ee`
MD5	`4717ce73c2c2dcedcaa8da41997e17de`
BLAKE2b-256	`0969a970f488515e7aedb3e67cff7f59b7e61b00eeda0e6eeb468562234e6c57`

See more details on using hashes here.

aletheia-mbs 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MBS

30-Second Demo

What MBS Does

YC Benchmark Sample

Core Commands

Docs

Positioning

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes