Deterministic VM for LLM program execution
Project description
Execution governance runtime with deterministic receipts and replayable traces.
LLMs are signal generators. Execution authority belongs to the runtime.
What nano-vm produces
Most AI frameworks answer: how do you coordinate agents?
Almost nobody answers: what proof do you have that execution happened correctly?
nano-vm answers the second question by producing verifiable artifacts at every layer:
| Artifact | What it is | How it's produced |
|---|---|---|
| Trace | Step-by-step execution record with SHA-256 Merkle chain | Runtime — one entry per step |
| ExecutionReceipt | Minimal decision state derived from Trace | TraceAnalyzer.receipt() — deterministic projection |
| GovernanceEnvelope | Per-step policy hash + canonical state snapshot | Runtime — stored in SQLite WAL |
The core contract:
Receipt = f(Trace)
The receipt is a deterministic, recomputable projection of the trace. It never contains information that isn't already in the trace. It never requires the LLM to generate it.
Program
↓
Validator ← static analysis before execution
↓
ExecutionVM ← FSM transition authority
↓
Trace ← authoritative execution record
↓
TraceAnalyzer ← post-hoc interpretation
↓
Receipt ← minimal state for continuation decisions
Install
pip install llm-nano-vm
pip install llm-nano-vm[litellm] # for LLM provider support
Quick Start — Tool Pipeline
LLMs are not required. nano-vm runs as a pure deterministic workflow engine.
from nano_vm import ExecutionVM, Program
program = Program.from_dict({
"name": "payment_flow",
"steps": [
{"id": "reserve", "type": "tool", "tool": "reserve_funds"},
{"id": "capture", "type": "tool", "tool": "capture_payment"},
{"id": "receipt", "type": "tool", "tool": "send_receipt"},
]
})
vm = ExecutionVM(tools={
"reserve_funds": reserve_funds,
"capture_payment": capture_payment,
"send_receipt": send_receipt,
})
trace = await vm.run(program)
print(trace.status) # SUCCESS
The runtime still guarantees: deterministic ordering, replayable execution, trace visibility, transition enforcement, idempotent re-execution across restarts.
Quick Start — LLM Pipeline
from nano_vm import ExecutionVM, Program
from nano_vm.adapters import LiteLLMAdapter
program = Program.from_dict({
"name": "customer_refund",
"steps": [
{
"id": "analyze",
"type": "llm",
"prompt": "Is this a valid refund request? Reply 'yes' or 'no'.\nRequest: $user_input",
"output_key": "decision",
"allowed_outputs": ["yes", "no"], # runtime enum gate — not a prompt hint
},
{
"id": "guardrail",
"type": "condition",
"condition": "'yes' in '$decision'",
"then": "process_refund",
"otherwise": "reject",
},
{"id": "process_refund", "type": "tool", "tool": "issue_refund", "is_terminal": True},
{"id": "reject", "type": "tool", "tool": "send_rejection", "is_terminal": True},
],
})
vm = ExecutionVM(
llm=LiteLLMAdapter("openai/gpt-4o-mini"),
tools={"issue_refund": ..., "send_rejection": ...},
)
trace = await vm.run(program, context={"user_input": "I was charged twice"})
print(trace.status) # SUCCESS
print(trace.total_cost_usd()) # e.g. 0.000034
The guardrail step cannot be skipped, reordered, or overridden by the model.
Suspend / Resume
Return "PENDING" from any tool to suspend execution:
async def initiate_payment(**kwargs) -> str:
await register_webhook(kwargs["order_id"])
return "PENDING" # FSM → SUSPENDED, cursor persisted
FSM transition: RUNNING → SUSPENDED → RUNNING → SUCCESS
This enables: payment settlement, courier confirmation, approval systems, human-in-the-loop, webhook orchestration. The process can restart. The cursor survives.
trace = await vm.run(program, context={"order_id": "123"})
assert trace.status.name == "SUSPENDED"
trace = await vm.resume_with_program(
program=program,
trace_id=trace.trace_id,
webhook_event={"type": "payment.confirmed", "order_id": "123"},
)
assert trace.status.name == "SUCCESS"
Note:
"PENDING"is a reserved FSM sentinel. Use"REQUIRES_ACTION"or"AWAITING_3DS"for domain-specific states.
LLM Output Enforcement — allowed_outputs
Validates the model's raw output against an explicit enum before it enters the FSM context. This is a runtime gate, not a prompt hint.
{
"id": "classify",
"type": "llm",
"prompt": "Classify the request. Reply ONLY with: refund / query / other",
"output_key": "category",
"allowed_outputs": ["refund", "query", "other"],
"on_error": "skip", # output → "refund" (first element) on mismatch
}
on_error |
On mismatch |
|---|---|
fail (default) |
VMError → trace.FAILED |
skip |
output replaced with allowed_outputs[0] |
retry |
retry up to max_retries; VMError if exhausted |
Per-step LLM timeout:
{
"id": "classify",
"type": "llm",
"timeout_seconds": 10.0,
"on_timeout": "fail", # or "fallback" → allowed_outputs[0] or ''
}
ProgramValidator — Pre-flight Static Analysis
Validates a program before execution. Catches structural issues that would cause runtime failures.
from nano_vm import ProgramValidator
validator = ProgramValidator(program)
report = validator.validate()
print(report.is_valid()) # False if any ERROR-severity issue found
for issue in report.issues:
print(issue.severity, issue.code, issue.message)
Checks performed:
| Check | Severity | Description |
|---|---|---|
missing_targets |
ERROR | branch targets that reference non-existent steps |
unreachable_steps |
ERROR | steps unreachable from any execution path |
cycle_detection |
ERROR | cycles that would cause infinite loops |
no_failure_terminal |
WARNING | no reachable terminal step with a failure outcome |
is_valid() returns True only when no ERROR-severity issues exist. WARNING is informational — simple linear programs without failure terminals are valid.
TraceAnalyzer — Post-hoc Execution Analysis
Analyzes a completed trace. Pure post-processing — no changes to the runtime state.
from nano_vm import TraceAnalyzer
analyzer = TraceAnalyzer(trace)
report = analyzer.report() # TraceHealthReport — lazy, cached
print(report.rollback_density) # 0.0 – 1.0
print(report.tool_churn_rate) # 0.0 – 1.0
print(report.path_variance) # 0.0 – 1.0
print(report.transition_entropy) # bits
print(report.invariant_violation_rate) # 0.0 – 1.0
Alert thresholds (informational — do not interrupt FSM):
| Metric | Alert threshold | Signal |
|---|---|---|
rollback_density |
> 0.3 | excessive compensating transitions |
tool_churn_rate |
> 0.4 | unstable tool selection |
path_variance |
> 0.5 | non-deterministic branching |
transition_entropy |
> 2.5 bits | ~5+ unique transition pairs |
invariant_violation_rate |
> 0.2 | repeated constraint failures |
Alerts are warnings, not errors. The FSM is never interrupted by the analyzer.
ExecutionReceipt — Receipt = f(Trace)
The receipt is a deterministic projection of the trace. It contains the minimal state needed for a continuation decision — whether to resume, replay, or escalate.
receipt = analyzer.receipt() # lazy, cached; recomputable at any time
print(receipt.trace_id)
print(receipt.final_status) # TraceStatus
print(receipt.resumable) # bool
print(receipt.replayable) # bool
print(receipt.failed_steps) # int
print(receipt.retried_steps) # int
print(receipt.rejected_transitions) # tuple[RejectedTransition, ...]
print(receipt.health) # TraceHealthReport
RejectedTransition captures each failed step with its reason and timestamp:
for rt in receipt.rejected_transitions:
print(rt.step_id, rt.reason, rt.timestamp)
Contract invariants:
Receipt ⊆ Trace— the receipt never contains information outside the trace- No LLM generation — the receipt is computed, never summarized
- No operator-specific fields — the receipt is infrastructure, not application logic
- Recomputable — given the same trace,
receipt()always returns the same result
Evaluator Blindness by Design
Benchmark research across 10 frontier models found that every model changes behavior when it detects it's being monitored. The best model scored 84% on "acts the same whether watched or not."
The ProjectionLayer gives the LLM only a target=LLM projection of state. Governance metadata — GovernanceEnvelope, canonical_hash, policy_hash, retry counters — never reaches the prompt. The model cannot observe its own audit trail.
Evaluator blindness is structural, not configured.
FSM Transition Model
| Current state | Event | Next state |
|---|---|---|
RUNNING |
tool success | RUNNING |
RUNNING |
tool returns "PENDING" |
SUSPENDED |
RUNNING |
tool error (on_error=fail) |
FAILED |
RUNNING |
tool error (on_error=skip) |
RUNNING (output=None) |
RUNNING |
condition branch taken | RUNNING (jump to then/otherwise) |
RUNNING |
max_steps / max_tokens exceeded |
BUDGET_EXCEEDED |
RUNNING |
max_stalled_steps exceeded |
STALLED |
RUNNING |
no more steps | SUCCESS |
SUSPENDED |
resume_with_program() called |
RUNNING (from cursor) |
| terminal | — | absorbing (no further transitions) |
Terminal states: SUCCESS, FAILED, BUDGET_EXCEEDED, STALLED.
Failure states are first-class outcomes. FAILED, NEED_HELP, INSUFFICIENT_DATA, POLICY_BLOCKED are legitimate terminal states equivalent to SUCCESS — not exceptions to be swallowed.
Program DSL
Four step types:
| Type | Purpose |
|---|---|
llm |
call the model; result stored in output_key |
tool |
call a Python function; return "PENDING" to suspend |
condition |
branch on an expression; then / otherwise |
parallel |
run independent sub-steps concurrently via asyncio.gather |
Step fields:
| Field | Default | Description |
|---|---|---|
on_error |
fail |
fail · skip · retry |
max_retries |
3 |
total attempts; backoff: 1s, 2s, 4s… cap 30s |
max_concurrency |
None |
parallel blocks only |
is_terminal |
False |
return SUCCESS after this step (leaf nodes) |
next_step |
None |
jump to named step instead of returning SUCCESS |
allowed_outputs |
None |
LLM-only — accepted output enum; ValidationError if empty |
timeout_seconds |
None |
LLM-only — asyncio.wait_for timeout in seconds |
on_timeout |
'fail' |
'fail' · 'fallback' (→ allowed_outputs[0] or '') |
Program budget options:
| Option | Default | Description |
|---|---|---|
max_steps |
None |
BUDGET_EXCEEDED if exceeded |
max_stalled_steps |
None |
STALLED after N consecutive no-op fingerprints |
max_tokens |
None |
BUDGET_EXCEEDED when total tokens exceed limit |
Variable interpolation
| Syntax | Resolves to |
|---|---|
$key |
value from initial context (typed — int/dict/list preserved) |
$step_id.output |
output of a previous step |
$step_id.output.field |
field within a step's dict output |
Condition expressions
⚠ ASTEngine replaces
eval(). Conditions are parsed into a validated JSON AST and evaluated by a pure, sandboxed interpreter. No Python builtins are accessible.
Supported: ==, !=, >, <, in, not in, and, or, not, contains, dotted-path $var.field.
Not supported: method calls (.lower(), .strip()), arithmetic, parentheses grouping. Using an unsupported form raises ASTEvalError at parse time.
# ❌ WRONG — method call raises ASTEvalError
{"condition": "'yes' in '$decision'.lower()"}
# ✅ CORRECT
{"condition": "'yes' in '$decision'"}
MCP Integration
nano-vm pairs with nano-vm-mcp — an MCP gateway
that exposes run_program, get_trace, list_programs, get_program, delete_program
over stdio or SSE transport with bearer auth, SQLite WAL persistence, and GovernanceEnvelope audit trail.
Claude Code / MCP Client
↓
nano-vm-mcp ← decides how execution is allowed to proceed
↓
ExecutionVM ← transition authority
↓
GovernanceEnvelope ← proves it happened
GovernanceEnvelope
Each successful execution step produces a GovernanceEnvelope stored in SQLite WAL:
| Field | Description |
|---|---|
execution_id |
Session / trace identifier |
step_id |
Step index within the execution |
policy_hash |
SHA-256 of the active PolicySnapshot |
canonical_snapshot_hash |
Merkle/delta hash of CanonicalState |
payload |
Projected (sanitized) step output |
CapabilityRef and GDPR Tombstoning
Sensitive values are stored as CapabilityRef tokens (vault://secret/<id>).
On a GDPR erasure event, the ref is tombstoned. All subsequent projections return
[REDACTED_TOMBSTONE], preserving the hash chain. Forensic auditability survives erasure.
Observability
trace.trace_id # UUID4 — stable for OTel propagation
trace.status # TraceStatus.SUCCESS | FAILED | SUSPENDED | BUDGET_EXCEEDED | STALLED
trace.final_output
trace.total_tokens() # O(1) incremental accumulator
trace.total_cost_usd() # requires LiteLLMAdapter
trace.state_snapshots # list[(step_index, sha256_hex)]
for step in trace.steps:
print(step.step_id, step.status, step.duration_ms, step.usage)
Testing — Deterministic by Design
from nano_vm import ExecutionVM, Program, TraceStatus
from nano_vm.adapters import MockLLMAdapter
vm = ExecutionVM(llm=MockLLMAdapter("yes")) # always returns "yes"
# Per-call sequence
vm = ExecutionVM(llm=MockLLMAdapter(["SAFE", "yes"]))
# Per-prompt substring mapping
vm = ExecutionVM(llm=MockLLMAdapter({
"Classify": "SAFE",
"eligible": "yes",
"__default__": "ok",
}))
trace = await vm.run(program, context={"user_input": "refund"})
assert trace.status == TraceStatus.SUCCESS
Same input → same step sequence. No API key required.
State Determinism vs Semantic Determinism: nano-vm guarantees step execution order, no skipping, reproducible trace structure — regardless of LLM output. It does not guarantee that LLM text is identical across runs. Use MockLLMAdapter for both.
Performance
| Suite | Result |
|---|---|
| FSM invariant suite | 13/13 · 1,020,000 ops · 0 violations |
| Integration suite | 10/10 · 1,096,500 ops · 0 violations |
| 10k stress | 14,327 graphs/sec · 0.70 s/run |
| MoMo PoC v4 | 9/9 PASS |
| Stripe PoC v1 | 9/9 PASS |
| ID | Scenario | Mean TPS | p95 avg |
|---|---|---|---|
| BM-INT-01 | Refund pipeline | 2,300/s | 0.66 ms |
| BM-INT-02 | Double-execution guard | 2,400/s | 0.67 ms |
| BM-INT-03 | Budget enforcement | 1,100/s | 331 ms |
| BM-INT-04 | Parallel throughput | 436/s | 542 ms |
| BM-INT-05 | MCP store round-trip | 3,000/s | 0.42 ms |
| BM-INT-06 | GovernanceEnvelope | 1,300/s | 171 ms |
| BM-INT-07 | Crash consistency | 7/s | 233 ms |
| BM-INT-08 | Replay equivalence | 1,300/s | 1.30 ms |
| BM-INT-09 | Adversarial retries | 2,400/s | 0.64 ms |
| BM-INT-10 | Long-horizon | 30/s | 3,606 ms |
Environment: QEMU/KVM · Intel Xeon E5-2697A v4 · 2 cores · Python 3.12 · Mock adapter.
Comparison
| LangChain | CrewAI | Temporal | nano-vm | |
|---|---|---|---|---|
| LLM-native | ✅ | ✅ | ❌ | ✅ |
| Deterministic execution | ❌ | ❌ | ✅ | ✅ |
| Replayable traces | partial | minimal | ✅ | ✅ |
| Deterministic receipt | ❌ | ❌ | ❌ | ✅ |
| Suspend/resume | partial | partial | ✅ | ✅ |
| Runtime guardrails | ❌ | ❌ | partial | ✅ |
| LLM output enforcement | ❌ | ❌ | ❌ | ✅ |
| Pre-flight validator | ❌ | ❌ | ❌ | ✅ |
| Evaluator blindness | ❌ | ❌ | ❌ | ✅ |
| Lightweight / embedded | ❌ | ❌ | ❌ | ✅ |
vs Marvin / DSPy: those optimize what the LLM produces. nano-vm controls when and whether steps run — orthogonal concerns, composable.
vs Temporal / Cadence: Temporal solves durable execution for distributed systems. nano-vm solves governed execution for LLM workflows — embedded, no infrastructure, Python-native.
When to Use
Use nano-vm when:
- workflow structure is known in advance
- correctness and auditability matter (fintech, compliance, enterprise)
- you need a reproducible trace for debugging or audit
- guardrails must be enforced at the system level, not in the prompt
- async orchestration with suspend/resume is required
- you need a deterministic receipt to prove what happened
Do not use when:
- workflow must be discovered fully at runtime
- the task is open-ended creative reasoning
- fully autonomous multi-agent coordination is required
Roadmap
Done:
- Deterministic FSM runtime (v0.1)
-
parallelsteps —asyncio.gather(v0.2.0) -
retrypolicy +max_concurrency(v0.3.0) - Budget guards:
max_steps,max_stalled_steps,max_tokens(v0.4.0) -
state_snapshots— sha256 fingerprint per step (v0.4.0) -
Planner— intent → Program in 1 LLM call (v0.5.0) - FSM invariant stress suite — 13/13 · 1,020,000 ops (v0.6.0)
-
suspend / resume—"PENDING"sentinel +CursorRepository(v0.7.0) -
BudgetInterrupt+_emit_interrupt()hook (v0.7.0) -
Trace.trace_id— UUID4, OTel-ready (v0.7.0) -
erase()— GDPR tombstoning with hash-chain preservation (v0.7.0) -
ASTEngine—eval()removed; sandboxed condition evaluator (v0.7.0) - Integration benchmark suite — 10/10 · 1,096,500 ops (v0.7.3)
-
Step.is_terminal,Step.next_step— branch semantics (v0.7.4) - ASTEngine METHOD_CALL guard —
ASTEvalErrorat parse time (v0.7.5) -
py.typedmarker — PEP 561 (v0.7.4) - MCP server —
nano-vm-mcpwith GovernanceEnvelope, CapabilityRef, SSE + stdio -
Step.allowed_outputs— LLM output validation against enum (v0.8.0) -
Step.timeout_seconds+on_timeout— per-step LLM timeout (v0.8.0) -
inspect.iscoroutinefunction— Python 3.14 deprecation fix (v0.8.2) -
TraceAnalyzer— rollback density, tool churn rate, path variance, transition entropy, invariant violation rate (v0.8.3) -
LLMAdapter.complete()—str | tuple[str, dict | None]Protocol (v0.8.4) -
ProgramValidator— missing targets, unreachable steps, cycle detection, no_failure_terminal (v0.8.5) -
ExecutionReceipt+RejectedTransition—Receipt = f(Trace)contract (v0.8.5)
Upcoming:
- OpenTelemetry span per FSM step
-
nano-vm-mcp:GovernedToolExecutorcircuit breaker -
vm.step()incremental execution endpoint -
depends_on+TopologicalSorter— declarative dependency graph
Contact & Support
Author: @ale007xd on Telegram · @ale007xd on X
USDT (TON): UQCakyytrEGBikOi3eYMpveGHXDB1-fd6lcuQC9VvKqMrI-9
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_nano_vm-0.8.5.tar.gz.
File metadata
- Download URL: llm_nano_vm-0.8.5.tar.gz
- Upload date:
- Size: 155.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d270b0eec060caa6903f94bf8e5316cffa3b79f1976703324c2ae0fa0ee16f82
|
|
| MD5 |
6a6d153a3349380ce945f518b35ec55f
|
|
| BLAKE2b-256 |
415c9683cb4f2a0488b1444b8c29f98b0c00f0724012942bc75862e7bc1e95bc
|
File details
Details for the file llm_nano_vm-0.8.5-py3-none-any.whl.
File metadata
- Download URL: llm_nano_vm-0.8.5-py3-none-any.whl
- Upload date:
- Size: 48.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27271f6dc463fcb2b3f160ce14988fab577edb0bbd2fdfffe7ccb9e0ec050120
|
|
| MD5 |
1e6163765e8281ec8b14221e2eb26456
|
|
| BLAKE2b-256 |
acd83a8e2c7f3e5dc860c22ffbebd2a949da5b7dcf538f42e4a1e0c625c871d6
|