Okto Nexus - Local Agent Coordination Bus (MCP server + dashboard)
Project description
Okto Nexus
Local Agent Coordination Bus — an MCP server that lets multiple AI coding agents coordinate on a shared project: register identities, open sessions, exchange messages on channels, hand off work with atomic single-winner claiming, publish artifacts, and tail an append-only event log — entirely on one machine, backed by a single SQLite database.
One command — okto-nexus serve — starts the whole hub on a single port:
/mcp— the full 35-tool surface over streamable HTTP, gated by per-agent API keys (nxs_…, hash-only at rest, plaintext shown once);/api/v1— a REST API for agent/key management, observability (live graph snapshot, message/handoff/session/event histories) and admin actions, plus an SSE stream with exactLast-Event-IDresume;/— a built-in web dashboard (React + Sigma.js, light/dark, Okto Pulse visual language): a live graph of agents with derived presence, in-flight traffic badges, a handoff kanban, per-peer chat timelines, runtime settings and store maintenance.
The classic stdio mode (okto-nexus with no subcommand) remains intact and
byte-compatible for MCP hosts that launch local processes.
Okto Nexus is the coordination substrate for a team of agents operating in the
same repository. Instead of talking to a cloud service or a message broker, every
agent speaks to the same local hub; all state lives in one SQLite file in
WAL mode, which is the single source of truth. Every coordinated entity is scoped
to a deterministic workspace_id derived from the project path, so two agents
pointed at the same real directory automatically share one coordination space —
and never see another workspace's data.
Table of Contents
- Highlights
- Architecture
- Requirements & Installation
- Configuration
- The HTTP Hub & Dashboard (
okto-nexus serve) - Running It / MCP Client Setup
- Core Concepts
- Tool Reference
- Data Model
- Response Envelope & Error Catalog
- Operations
- Troubleshooting
- Example Flow
- Testing
- Project Layout
- Limitations (V1 Non-Goals)
- Roadmap
- License
Highlights
- Local-first, two transports.
okto-nexus serveexposes the SAME 35-tool surface over streamable HTTP (parity enforced by a test) plus REST, SSE and the web dashboard — all on one loopback port. The classic stdio mode is unchanged for MCP hosts that launch local processes. Loopback REST and the dashboard need no key (same-machine trust);/mcpalways requires a per-agent API key, and any bind beyond loopback re-enables the key gate everywhere (fail-closed on the network). - SQLite WAL is the single source of truth. One file (
~/.okto_nexus/nexus.dbby default). No in-memory cache, broker, or external store. Every connection enforces WAL + foreign keys + a busy timeout. - Deterministic workspace isolation.
workspace_id = sha256(realpath(project_root)). The client passesproject_root; the server computes the hash. Every read and write is scoped to one workspace — the deliberate cross-workspace surfaces are the global-adminworkspace_listand the globalagent_list/agent_get/capability_list(agents are global identities). - Three ways to reach an agent, two delivery semantics. Direct message
(1:1, preferred), handoff (one free worker claims), or broadcast (disseminate
info / open-ended discovery, last resort) — pub/sub fan-out for messages vs
competing-consumers for handoffs. Peers and skills are discoverable
(
agent_list/agent_get/capability_list). See How agents communicate. - Append-only, monotonic event log. A single global
eventstable with a gaplessINTEGER AUTOINCREMENTevent_id. State mutations and their audit events commit in the same transaction (atomic). - Cursor pagination + long-polling without threads.
event_get/event_wait(andhandoff_list_available) tail the log with cursors and an optional long-poll bounded by a configured ceiling — no sockets, threads, or subscriptions. Waiting goes through aWaiterport whose V1 adapter sleep-polls gated by a cheapPRAGMA data_versionprobe: the log is re-scanned only when some process actually committed a write (a future SSE/notify transport swaps in at the same port). - Durable per-agent inbox with at-least-once delivery (ADR 0001). Messages
fan out to a global inbox at send time; pulled messages are leased
(caller-tunable, renewable via
inbox_extend) and redelivered if unacked; poison messages park in a dead-letter lane after 5 attempts;inbox_peek/inbox_count/inbox_history/message_statusare strictly read-only. - Cooperative trust + bounded growth (ADR 0002).
trust_mode=strictrequires a per-sessionsession_secreton every sensitive verb (inopenmode a supplied secret is still validated);okto-nexus admin prune(and the opt-inauto_prune_on_start) enforce retention windows over terminal lanes only — live state is never deleted. - Atomic single-winner handoffs with leases. Claiming is one conditional
UPDATE(no TOCTOU race). Abandoned work returns to the pool via opportunistic lease expiry evaluated on access — no background reaper. - Routing / visibility / eligibility as pure functions. Six target strategies, three visibilities; seeing is orthogonal to claiming.
- Workspace-contained artifacts. Inline (
text/json/markdown) or by-reference (file+path), with strict path-containment checks that reject any path escaping the workspace root. - Deterministic human-readable view.
shared_md_renderwrites a four-sectionshared.mdsnapshot via atomic overwrite — never a source of truth, never read back. - Hexagonal architecture with an enforced import boundary.
domain/andapplication/are stdlib-only and may never importsqlite3ormcp(a test fails the build otherwise). - Closed, normative error catalog. Exactly 18 error codes; no exception
ever crosses the adapter boundary — anything unexpected normalizes to
INTERNAL_ERROR. Transient WAL contention carriesdetails.retryable: trueso agents can branch without parsing messages. - Fail-closed bootstrap. Config → home dir → DB connections → migrations → repos/emitter → tool auto-discovery, strictly ordered; any failure aborts.
Architecture
Okto Nexus is a hexagonal (ports & adapters) application. Dependencies point strictly inward: adapters depend on application ports; the application depends on the domain; the domain depends on nothing but the standard library.
INBOUND ADAPTER OUTBOUND ADAPTERS
┌──────────────────────────────┐ ┌────────────────────────────────┐
│ adapters/inbound/mcp │ │ adapters/outbound/sqlite │
│ server.py (FastMCP, stdio)│ │ connection.py (PRAGMAs/UoW) │
│ tools/* register(srv,deps)│ │ migrations.py (runner) │
└──────────────┬───────────────┘ │ *_repo.py (Sqlite*Repo) │
│ depends on │ adapters/outbound/file/store │
│ (Protocols) │ adapters/outbound/sharedmd │
▼ │ adapters/outbound/clock │
┌────────────────────────────────────┴──────────┐ │ implements ports
│ APPLICATION (ports.py) │◄─┘
│ Clock · UnitOfWork · ConnectionFactory │
│ WorkspaceRepo · AgentRepo · SessionRepo │ dependencies point
│ EventRepo · EventEmitter · ChannelRepo │ INWARD ↑↑
│ MessageRepo · TaskRepo · HandoffRepo │
│ ArtifactRepo · FileStore · Repos │
└────────────────────┬───────────────────────────┘
▼ depends on
┌────────────────────────────────────────────────┐
│ DOMAIN (models, ids, routing, events, …) │
│ pure, stdlib-only, no I/O │
└────────────────────────────────────────────────┘
Import boundary enforced by tests/test_import_boundary.py:
domain/ and application/ MUST NOT import sqlite3 or mcp.
Layers
domain/— entity dataclasses and pure helpers (ids,routing,events,messages,handoff,artifacts,models,base). No I/O; never importssqlite3ormcp.application/—typing.Protocol(@runtime_checkable) ports that are the seams of the hexagon, plus the use-case services. Inbound depends on the ports; outbound implements them. The concrete SQLite connection type is referenced asAnyto keep this layer free ofsqlite3.adapters/inbound/mcp/— theFastMCPstdio server and the auto-discovered tool modules. The MCP SDK is imported lazily (only insidecreate_server/main), so importing the package never requires the SDK.adapters/outbound/— SQLite repos, the workspace file store, theshared.mdrenderer, andSystemClock. These are the only places that importsqlite3.
SQLite WAL as source of truth. Each connection opens with
isolation_level=None (driver autocommit; transactions are explicit
BEGIN/COMMIT/ROLLBACK), row_factory = sqlite3.Row, and the three
mandatory PRAGMAs: journal_mode=WAL, foreign_keys=ON, and
busy_timeout={busy_timeout_ms}. A SqliteUnitOfWork opens BEGIN IMMEDIATE on entry by default (write=True): the WAL write lock is acquired
up front, where busy_timeout applies, so a read-then-write sequence inside
the transaction can never die mid-way with SQLITE_BUSY_SNAPSHOT. Read-only
scopes pass write=False (deferred BEGIN; readers never queue behind
writers). The UoW commits on clean exit, rolls back on exception, always closes
the connection, and never suppresses exceptions. Failure to open/configure a
connection → DB_ERROR; lock/busy contention → DB_ERROR with
details.retryable: true (retry the same call).
Fail-closed bootstrap (ordered). bootstrap() in server.py:
load_config(env, argv)— resolve config (CONFIG_ERRORon bad input).ConnectionFactory(config)— ensurehome_direxists (mkdir(parents=True, exist_ok=True)).- Configured SQLite connections via the factory.
MigrationRunner(factory).apply()— apply pending migrations (idempotent;MIGRATION_ERRORon failure).build_repos(clock)+ tool registration — only after the store is migrated and healthy.
main() catches OktoNexusError from bootstrap, prints
[okto-nexus] bootstrap failed: {code}: {message} to stderr, and returns exit
code 1. If the mcp SDK is missing in create_server, it catches the
ImportError, prints an install hint, and returns 1. On success it runs the
stdio server and returns 0. The first argv token selects the mode: tail
dispatches to the NDJSON event-log follower and admin to the operator
maintenance commands (see Operations); anything else runs the
MCP server. With auto_prune_on_start=true, startup additionally runs one
bounded, best-effort retention pass (a failure is reported to stderr and
startup proceeds).
Auto-discovery of tools. register_tools(server, deps) iterates every module
under okto_nexus.adapters.inbound.mcp.tools (via pkgutil.iter_modules +
importlib.import_module); each module participates by exposing
def register(server, deps) -> None. build_repos is the single composition
root for persistence — it instantiates one concrete per port plus the shared
SqliteEventEmitter before any tool registers, so every slice reuses one
coherent backing store (each slice's own wiring is idempotent).
Enforced import boundary. tests/test_import_boundary.py AST-parses every
.py under domain/ and application/ and fails the suite if any import /
from names the sqlite3 or mcp root (relative imports are always allowed).
Requirements & Installation
- Python:
>= 3.11. - Runtime dependencies:
mcp >= 1.0,pydantic >= 2. - Dev extra:
pytest >= 8. - Build backend:
setuptools >= 68(src/layout). - Console script entry point:
okto-nexus = okto_nexus.adapters.inbound.mcp.server:main.
Clone, create a virtual environment, and install editable with the dev extras.
PowerShell (Windows):
git clone <repo-url> okto_nexus
cd okto_nexus
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
Bash (macOS / Linux):
git clone <repo-url> okto_nexus
cd okto_nexus
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
This installs the okto-nexus console script. If the mcp SDK is missing at
runtime, main() exits 1 with the hint: pip install 'mcp>=1.0' (or
pip install okto-nexus).
Global install (recommended for daily use). The [serve] extra brings the
HTTP hub + dashboard (FastAPI/uvicorn); the dashboard build ships inside the
wheel, so Node is never required on the user's machine:
pipx install "okto-nexus[serve]" # or: pipx install ".[serve]" from a checkout
okto-nexus serve # → http://127.0.0.1:8202
Package: okto-nexus v0.0.1 — Okto Nexus - Local Agent Coordination Bus
(MCP server + dashboard) · author Okto Labs · license Elastic License 2.0 +
SaaS/Branding Addendum + Trademark Policy.
Configuration
All settings live under the OKTO_NEXUS_* namespace. Precedence is strict:
CLI flag > environment variable > default. The CLI parser accepts both
--flag value and --flag=value. Invalid input (unparseable integer, value
below the minimum, out-of-vocabulary enum/boolean, unknown flag, unexpected
positional argument, or a flag with no value) fails closed with CONFIG_ERROR —
never a silently-applied default. Path values have ~ expanded; if
OKTO_NEXUS_DB_PATH is unset it is derived from the home directory.
| Environment variable | CLI flag | Default | Min | Description |
|---|---|---|---|---|
OKTO_NEXUS_HOME |
--home |
~/.okto_nexus |
— | Server home directory. Created idempotently at bootstrap. |
OKTO_NEXUS_DB_PATH |
--db-path |
{home}/nexus.db |
— | SQLite database file. Derived from home when unset. |
OKTO_NEXUS_BUSY_TIMEOUT_MS |
--busy-timeout-ms |
5000 |
0 |
PRAGMA busy_timeout (ms) applied to every connection. |
OKTO_NEXUS_POLL_INTERVAL_MS |
--poll-interval-ms |
200 |
1 |
Poll interval (ms) for waits / long-poll loops. |
OKTO_NEXUS_MAX_WAIT_TIMEOUT_SECONDS |
--max-wait-timeout-seconds |
30 |
0 |
Ceiling (s) for blocking long-poll operations. |
OKTO_NEXUS_HANDOFF_LEASE_TTL_SECONDS |
--handoff-lease-ttl-seconds |
300 |
1 |
TTL (s) of a claimed handoff's lease. |
OKTO_NEXUS_MAX_INLINE_BYTES |
--max-inline-bytes |
65536 |
1 |
Inclusive ceiling (UTF-8 bytes) for inline content. |
OKTO_NEXUS_INBOX_LEASE_TTL_SECONDS |
--inbox-lease-ttl-seconds |
300 |
1 |
Default in-flight lease (s) for pulled inbox deliveries (per-pull override: lease_seconds). |
OKTO_NEXUS_SESSION_STALE_TTL_SECONDS |
--session-stale-ttl-seconds |
60 |
1 |
Read-time threshold (s) after which a session's heartbeat reads stale. |
OKTO_NEXUS_PRESENCE_TTL_SECONDS |
--presence-ttl-seconds |
1800 |
1 |
Presence window (s): a session is PRESENT (in the broadcast audience) only while its heartbeat is this fresh. |
OKTO_NEXUS_SESSION_REAP_SECONDS |
--session-reap-seconds |
86400 |
1 |
Sessions silent past this (s) are opportunistically closed by session_open/session_heartbeat. |
OKTO_NEXUS_MAX_SHARED_MD_EVENTS |
--max-shared-md-events |
1000 |
1 |
Hard ceiling for shared_md_render's limit_events. |
OKTO_NEXUS_MAX_EVENT_LIMIT |
--max-event-limit |
1000 |
1 |
Hard ceiling for the limit page size on event_get / event_wait. |
OKTO_NEXUS_TRUST_MODE |
--trust-mode |
open |
— | One of open, strict. strict requires session_id + session_secret on the sensitive verbs; open validates a supplied secret but does not require one. |
OKTO_NEXUS_RETENTION_EVENTS_KEEP_DAYS |
--retention-events-keep-days |
30 |
0 |
Retention window (days) for the event log (prune-eligible past it). |
OKTO_NEXUS_RETENTION_READ_DELIVERIES_KEEP_DAYS |
--retention-read-deliveries-keep-days |
14 |
0 |
Retention window (days) for acknowledged (read) deliveries. |
OKTO_NEXUS_RETENTION_CLOSED_SESSIONS_KEEP_DAYS |
--retention-closed-sessions-keep-days |
7 |
0 |
Retention window (days) for closed sessions. |
OKTO_NEXUS_AUTO_PRUNE_ON_START |
--auto-prune-on-start |
false |
— | Boolean (true/1/yes/on / false/0/no/off). Run one bounded, best-effort retention pass at server startup. |
Retention only ever deletes terminal lanes (events past their window,
read deliveries, closed sessions) — see Operations.
load_config(env, argv=None) is the single entry point and depends only on the
stdlib + okto_nexus.errors (it imports neither mcp nor sqlite3).
The HTTP Hub & Dashboard (okto-nexus serve)
okto-nexus serve # zero-config: data in ~/.okto_nexus, port 8202
okto-nexus serve --port 9000 --host 0.0.0.0 --project-root .
okto-nexus serve --help
Boot prints the Okto Nexus ASCII banner (suppress with
OKTO_NEXUS_NO_BANNER=1) and announces readiness only after the MCP
session manager is running and the socket is accepting connections:
[okto-nexus] MCP Server initialized successfully - http://127.0.0.1:8202/mcp
[okto-nexus] API Server initialized successfully - http://127.0.0.1:8202/api/v1
[okto-nexus] Dashboard initialized successfully - http://127.0.0.1:8202
[okto-nexus] Startup complete - Okto Nexus is ready
A single-server lock (nexus.serve.lock, PID + mtime heartbeat) prevents a
second serve over the same home; a crashed serve is taken over automatically
after 60 s of heartbeat silence.
Authentication model
| Surface | Loopback bind (default) | Non-loopback bind |
|---|---|---|
Dashboard / + REST /api/v1 |
open (same-machine trust) | API key required |
MCP /mcp |
API key always required | API key always required |
Every agent owns an nxs_ + 48-hex API key: only the SHA-256 hash is
persisted; the plaintext is shown once at creation/regeneration. Keys are
accepted via ?api_key=, the x-api-key header, or Authorization: Bearer.
Rotation kills the old key immediately; deactivation is the reversible
revocation path. Binding beyond loopback with zero issued keys triggers a
one-time anti-lockout bootstrap (an operator agent + key, printed once).
Migrate stdio-era agents in batch with okto-nexus admin issue-keys (additive,
idempotent). On stdio, an optional OKTO_NEXUS_API_KEY binds the session to a
key-derived identity (fail-closed when set; absent = V1 behaviour).
The dashboard
Built with React + Tailwind + Sigma.js in the Okto Pulse visual language (same tokens, light/dark themes persisted per browser, Pulse-style header and collapsible navigation sidebar):
- Graph — live mesh of agents: node colour = presence derived from session
heartbeats (present < 60 s, stale < 30 min, offline), node size = activity,
edge thickness = 24h message volume, cyan edges with
N ✉badges = in-flight (unread) traffic, magenta squares/edges = open/claimed handoffs. Click any element to inspect; the right panel is drag-resizable. - Per-peer chat timelines — an agent's inbox/outbox rendered as chat bubbles with delivery ticks per lane, one tab per peer, plus a "no recipient" tab for sends whose target resolved to nobody.
- Handoffs — a Pulse-style kanban (Open / Claimed / Completed / Rejected / Cancelled) with confirm-guarded cancellation.
- Messages / Events / Workspaces — filtered histories and a live SSE tail.
- Agents & API keys — create agents, reveal the key once with MCP snippets for six client formats (Claude Code, Claude Desktop, Cursor, VS Code, Windsurf, Okto CLI), regenerate/deactivate/delete.
- Settings — every CLI knob editable at runtime with per-field
descriptions/tooltips (precedence
CLI > env > stored > default; CLI-pinned values are read-only), retention prune (dry-run/run) and the confirm-guarded database reset (schema-driven full wipe + VACUUM, optionally preserving agents/keys). - Help & About — built-in guide and the full licence text (served from the
packaged
LICENSEviaGET /api/v1/license).
Live updates ride GET /api/v1/stream (SSE keyed on the append-only
event_id): reconnections resume exactly after the last seen cursor — no loss,
no duplication.
Running It / MCP Client Setup
HTTP (recommended): start okto-nexus serve, create an agent in the
dashboard's Agents tab and paste the generated snippet — e.g. for Claude
Code:
claude mcp add -t http okto-nexus "http://127.0.0.1:8202/mcp?api_key=nxs_…"
stdio (classic): Okto Nexus also speaks MCP over stdio. Register it
with any MCP host that launches stdio servers (e.g. Claude Desktop's
claude_desktop_config.json) by pointing the host at the okto-nexus command:
{
"mcpServers": {
"okto-nexus": {
"command": "okto-nexus",
"env": {
"OKTO_NEXUS_HOME": "~/.okto_nexus",
"OKTO_NEXUS_HANDOFF_LEASE_TTL_SECONDS": "300",
"OKTO_NEXUS_MAX_WAIT_TIMEOUT_SECONDS": "30"
}
}
}
}
If okto-nexus is not on the host's PATH, point command at the venv entry
point instead — for example /path/to/.venv/bin/okto-nexus, or on Windows
C:\\path\\to\\.venv\\Scripts\\okto-nexus.exe. Any OKTO_NEXUS_* variable from
the Configuration tables can be supplied via env; the
NexusConfig settings may alternatively be passed as CLI flags via an args
array (e.g. "args": ["--max-inline-bytes", "131072"]).
A successful list_tools from the client proves the store was migrated and all
tools registered (the bootstrap is fail-closed and ordered).
Core Concepts
How agents communicate
Everything below is the conceptual model; each idea maps to concrete tools.
Agents are global identities that coordinate inside a workspace (the
client passes project_root, the server derives the workspace_id). Three ways
to reach one another sit on top of two delivery semantics plus a discovery
layer.
Two delivery semantics — the core distinction.
| Semantics | Tool | Who receives | Use when |
|---|---|---|---|
| Pub/sub fan-out | message_create |
every eligible agent can read it (a copy for all) | zero-or-many agents may legitimately read/act |
| Competing consumers | handoff_create |
every eligible agent sees it, but only the first handoff_claim wins (others get HANDOFF_ALREADY_CLAIMED); an abandoned claim's lease expires and it returns to the pool |
exactly one free agent should do the work |
A message is information broadcast to those allowed to see it; a handoff is a unit of work exactly one worker takes. (Seeing is not acting — see Routing / visibility / eligibility: an item can be visible to the whole workspace while only its target may claim it.)
Discovery — who is out there?
- Structured registry:
agent_list(every agent +role/capabilities/last_seen_at),agent_get(one agent's details + last interaction),capability_list(which capabilities exist and who advertises them). Use these to find anagent_id, a role, or a capability before addressing. - Semantic discovery (broadcast a question): when the answer is not in the registry — "who owns application X?", "who is impacted by change Y in XYZ?" — broadcast an open question; only the relevant agents answer, and they reply directly to you.
The three modes, in preference order (most targeted, least noise → least):
- Direct message — the default.
message_createwithtarget={"strategy":"direct","agent_id":"<recipient>"}. Most efficient, no spurious work. Always reply directly to whoever messaged you (target theirfrom_agent_id) unless you deliberately mean to broadcast or hand off. Examples: answering a question, acknowledging, a 1:1 follow-up, returning a result to the requester. - Handoff — when exactly one free agent should take the work.
handoff_createwith a capability / role / broadcast target. Every eligible agent sees it; the first tohandoff_claimowns it; the lease (TTL) returns abandoned work to the pool. Example: "OCR this scan" →target={"strategy":"capability","capability":"ocr"}(find the capability withcapability_listfirst); one free OCR worker claims and completes. See Handoffs & leases. - Broadcast — last resort.
message_createwith notarget. Two valid uses: (a) disseminate instructive/contextual info to everyone (announcements, conventions, status, shared decisions); (b) discovery (the open questions above). Never broadcast actionable "do X" requests — an undirected ask can trigger unwanted parallel work (every eligible agent may act on it). Once discovery finds the owner, switch to a direct message (or a handoff for dispatchable work). A broadcast reaches the workspace's present participants — sessions with a heartbeat withinpresence_ttl_seconds(default 30 min); agents excluded for staleness are reported back to the sender inexcluded_stale(never dropped silently).
Addressing — the target strategies (shared by messages and handoffs). The
target object decides who is eligible. Omitting it means broadcast.
| You want to reach… | strategy |
target shape |
|---|---|---|
| one named agent | direct |
{"strategy":"direct","agent_id":"…"} |
| anyone with a skill | capability |
{"strategy":"capability","capability":"ocr"} (string, or a list = any-of) |
| anyone in a role | role |
{"strategy":"role","role":"reviewer"} |
| everyone in the workspace | broadcast |
{"strategy":"broadcast"} (or omit target) |
| several rules at once (OR) | mixed |
{"strategy":"mixed","rules":[ … ]} |
| one agent now, a pool later | direct_with_fallback |
{"strategy":"direct_with_fallback","agent_id":"…","fallback_after_seconds":N,"fallback":<sub-target>} |
Exact match rules (case-sensitivity, the inclusive fallback boundary) are in Routing / visibility / eligibility.
Three independent axes — do not conflate them.
| Axis | Set by | Decides |
|---|---|---|
| Target (routing) | target strategy |
who is eligible to receive / claim |
| Visibility | visibility (public / eligible / private) |
who may see it (messages default: eligible if targeted, else public; handoffs require it explicitly) |
| Channel | channel_id |
where it is filed — an organizational label only |
Channels are labels, not access boundaries. There is no membership or ACL:
any agent in the workspace can read and post to any channel, and the channel
never decides who receives a message (the target does). Only general is
seeded; create the rest with channel_create (idempotent by name), discover them
with channel_list. When listening, omit channel_id to cover the whole
workspace across all channels — narrowing to one channel can make you miss
messages directed to you elsewhere. See Channels & messages.
Your inbox — how you receive messages (ADR 0001). Messages addressed to you
are delivered to your global inbox and stay there until you read them — no
cursor, no index, regardless of which workspace they were sent in. Between turns,
inbox_count (cheap check); if unread > 0, inbox_pull returns them with their
body (leased), then inbox_ack once handled. Unacked messages are redelivered
(at-least-once). inbox_peek is a non-destructive look; inbox_history is the
read archive. After sending something that expects a reply, just check your inbox
on a later turn. event_get/event_wait/okto-nexus tail are for observing
the bus, not for receiving your own messages. See
Message delivery & the inbox.
Concept → tool quick map.
| Goal | Tools |
|---|---|
| Discover agents / roles / capabilities | agent_list, agent_get, capability_list |
| Send a 1:1 message | message_create (direct target) |
| Broadcast info / ask an open question | message_create (no target) |
| Receive / read your messages | inbox_count, inbox_pull, inbox_ack, inbox_peek, inbox_history |
| Organize messages by topic | channel_create, channel_list |
| Dispatch work to one free worker | handoff_create, handoff_list_available, handoff_claim, handoff_complete, handoff_reject, handoff_cancel, handoff_get |
| Observe the whole bus | event_get, event_wait, okto-nexus tail |
| Identity & presence | agent_register, session_open/session_heartbeat/session_close |
Workspace isolation
Each project is a coordinated workspace identified by a deterministic hash of its real on-disk path:
workspace_id = sha256( realpath(project_root) ).hexdigest().lower() # 64 hex chars
The client passes project_root; the server computes the hash — clients
never send a raw workspace_id (except shared_md_render, which takes the
already-resolved id). resolve_realpath uses os.path.realpath(project_root, strict=True),
so symlinks are resolved and the path must exist; an empty, broken, or
nonexistent path → WORKSPACE_UNRESOLVED. There is never a fallback to a
shared/default workspace. workspace_resolve additionally requires the path to
be absolute (VALIDATION_ERROR otherwise) before attempting the realpath.
Every use case resolves the workspace_id up front and scopes all reads/writes
to it. Operations that take an entity id distinguish three cases:
- id exists nowhere →
NOT_FOUND; - id exists in another workspace →
WORKSPACE_MISMATCH(never leaks the other workspace's row); - id in the correct workspace → ok.
The cross-workspace reads are workspace_list (global-admin) and the global
agent_list/agent_get/capability_list (agents are global identities); every
workspace/session read stays scoped. Because workspace_id is a pure
function of realpath, identity is reproducible and aliasing-proof (two paths
with the same real target collide on purpose), with no client-side coordination.
Identity: agents vs sessions
There are two distinct identity layers:
| Concept | What it is | Scope |
|---|---|---|
agent_id |
A logical, global agent identity (role + capabilities + metadata) | Global — not workspace-scoped |
session_id |
A live instance of an agent operating inside a workspace | Bound immutably to (agent_id, workspace_id) |
agent_registerupserts the logical identity byagent_id; re-registering updatesrole/capabilities/metadatawithout changing the id. It is independent of sessions and workspaces.session_opencreates a session whosesession_idis assigned by the server (ses…), bound to(agent_id, workspace_id); both the workspace and the agent must already exist (NOT_FOUNDotherwise), and a workspace is required (WORKSPACE_REQUIRED). The response includes a per-sessionsession_secret(uuid4, returned only here — keep it): intrust_mode=strictthe sensitive verbs (message_create,handoff_claim/complete/reject,inbox_pull/ack/extend) requiresession_id+session_secret, and inopenmode a supplied secret is always validated (a wrong credential is never ignored).
Session status is derived, not persisted as truth. session_heartbeat
advances last_heartbeat_at and reports a derived status. Stale is computed at
read time: an active session whose last heartbeat is older than the stale TTL
(default 60 s, via OKTO_NEXUS_SESSION_STALE_TTL_SECONDS) is reported as
stale, while its row stays active. session_close is idempotent: the
first call sets status='closed' + closed_at and emits session.closed; a
second call is a no-op (keeps the original closed_at, does not re-emit).
Presence is one predicate. A session is present iff its persisted status
is active and its last heartbeat is within presence_ttl_seconds
(default 30 min). IdentityService.list_present is the single presence
read, and the message broadcast audience uses the exact same predicate —
a session listed as present is exactly a session that would receive a
broadcast, so "who is present" has one answer bus-wide. Heartbeat regularly:
agents whose every active session has gone heartbeat-stale are excluded from
broadcasts (surfaced to the sender in excluded_stale). There is still no
background thread, but session_open / session_heartbeat opportunistically
reap sessions silent past session_reap_seconds (default 24 h, closed
with reason stale) — dead sessions that never called session_close stop
accumulating state forever.
Session events (
session.opened/session.closed) are emitted on the canonicalworkspacestream withvisibility="public"and no routingtarget, so they ARE observable viaevent_get/event_waitlike message/artifact events. (Historically they used an internal"session"stream withvisibility="workspace"— neither value is in the canonical vocabularies, so those events were invisible to every consumer; the emit path now rejects out-of-vocabulary values fail-closed.)
Event log + streams + long-polling
events is a single, global, append-only log. event_id is
INTEGER PRIMARY KEY AUTOINCREMENT; under the WAL single-writer it is globally
monotonic, never reused, never altered (no UPDATE/DELETE exists for the
log). Each event is assigned inside the transaction of the slice that emits
it (atomic coupling via EventEmitter.emit).
Streams are semantic filters, not physical partitions:
VALID_STREAMS = { workspace, agent, handoff }
validate_stream rejects anything else with INVALID_STREAM before any scan.
Note that message.created and artifact.created are published on the
workspace stream (the base observable stream) so consumers of event_get /
event_wait can actually see them; handoff events go on the handoff stream.
event_get is a non-blocking, cursor-paginated read:
cursoris the lastevent_idconsumed; the scan selectsevent_id > cursor(normalize_cursor: integer ≥ 0,boolrejected).filterskeys are enumerated{type, agent_id, task_id, handoff_id}(equality, combined with AND);task_id/handoff_idcome from the payload (not columns in V1).- Visibility (
can_agent_see_event) is applied in the application layer, sonext_cursoradvances past every examined event (filtered or not-visible) — nothing already scanned is re-returned. - Returns
{events, next_cursor, has_more, timed_out: false}.has_moreis true iff a matched + visible event exists strictly beyond the page (via abatch_size = limit + 1probe). A "poisoned" event (malformed visibility/target) is treated as not-visible, so one bad event never wedges the stream.
event_wait is event_get plus a bounded wait on the Waiter port (no
socket/thread/subscription): it scans before waiting, so a non-empty first
page returns immediately (timed_out=false); on timeout it returns
events:[], the entry cursor unchanged, and timed_out=true. The V1
waiter (SleepPollWaiter) sleeps in poll_interval_ms steps and, between
sleeps, probes SQLite's PRAGMA data_version (a cheap cross-process change
counter) — the log is re-scanned only when some process committed a write,
never once per interval; a degraded probe fails open (re-scan after one
interval, so the read path surfaces any real error). Clamps: limit
None→default, <1 or non-int → VALIDATION_ERROR, above max → pinned to
max; timeout_seconds None→ceiling at the service layer (the MCP tool
defaults it to 0 = snapshot), <= 0 → a single event_get with no sleep,
else min(timeout, ceiling).
Monitoring patterns (background follower)
These patterns are for observing the bus (audit/monitoring) — to receive
messages addressed to you, use your inbox, not
these. event_wait is a blocking long-poll: with timeout_seconds > 0 the
call parks the caller's turn until an event arrives or the timeout expires. Pick
the mode that fits your harness so it is never forced to block:
-
Background follower (recommended). If your harness can spawn a detached process, run the CLI follower and treat each NDJSON line as a notification — the agent loop stays free, idle cost ~0. It is the layer-clean replacement for reading
nexus.dbdirectly (visibility/routing stay enforced):okto-nexus tail --project-root <path> --agent-id <you> --from latest \ --exclude-agent <you> # drop your own echo
-
In-loop, no background. Call with
timeout_seconds=0for a non-blocking snapshot (single scan, no sleep) and poll between turns, advancingcursor→next_cursor. Don't use a long timeout if you can't park the turn. -
Targeted wait. A short
timeout_seconds(e.g. 30) is fine to await the reply to a message you just sent, accepting the block.
Four things a real monitor must handle — the tail follower does all of them:
- Own echo. The follower emits every visible event, including the caller's
own, so a naive monitor reacts to itself.
--exclude-agent <you>drops them client-side (theevent_waitfilter is equality-only, so exclusion can't be server-side);--from-agent <x>includes a single author server-side. - Transient locks. A momentary WAL lock surfaces as
DB_ERROR; the follow loop retries those with bounded backoff (counter reset on each successful poll) while failing fast on terminal errors and surfacing a transient that won't clear. The cursor is not advanced across a failed poll, so no event is skipped. - Resume across restarts.
--cursor-file PATH(opt-in; the consumer owns the path) checkpoints the cursor atomically after every non-empty window; on restart an existing file precedes--from, a corrupted file fails closed (CONFIG_ERROR). A crash between flush and persist re-emits at most one window — at-least-once, coherent with the inbox lease/ack model. - Passive presence. The follower never stamps the agent's
last_seen_at(unlike the MCPevent_get/event_waittools), so a detached monitor cannot keep an otherwise-idle agent's liveness signal eternally fresh.
The block is a property of the current stdio long-poll. The roadmap's SSE/HTTP transport would replace polling with server push (no blocking, no busy-wait), at which point the background-follower workaround becomes optional.
Channels & messages
Channels are lightweight organizational labels, not access boundaries: there
is no membership or ACL, so every agent in the workspace can read and post to any
channel, and a channel never decides who receives a message — that is the
message target (see Routing). Only general is seeded per workspace
(idempotently, the first time channel_list touches it — no write path seeds it);
agents create any other channel they need with channel_create (idempotent
by name), so the bus is not pinned to a single purpose. A channel only tags a
message; it does not gate delivery (you receive messages through your
inbox, not by channel).
message_create requires non-empty from_agent_id, subject, and body
(VALIDATION_ERROR otherwise), enforces a 64 KB inclusive inline limit on
subject/body (CONTENT_TOO_LARGE beyond), accepts artifacts only as a
list of artifact_id references (never inline blobs), validates target
against the shared routing schema, and supports threading via
parent_message_id (the parent must exist in the same workspace — and, when a
channel is set, the same channel — else NOT_FOUND). It then resolves the
recipient set and fans out one inbox delivery per recipient (see
Message delivery & the inbox). The message row,
its deliveries, and the message.created event commit in the same unit of
work; the event goes on the workspace stream with visibility = "eligible" if target else "public" for observability.
Message delivery & the inbox
Receiving a message is not a log scan (V1's cursor model is gone, ADR 0001).
At send time message_create resolves the recipient set and writes one
delivery row per recipient into that recipient's global inbox — keyed by
agent_id, so a direct message reaches you in any workspace (the message keeps a
workspace_id only as context). The append-only event log still records
message.created for observability/tail; it is no longer the delivery channel.
Recipient resolution (domain/inbox.py, reusing is_agent_eligible):
direct/capability/role/mixed resolve against the global agent
registry; a bare top-level broadcast (and a no-target message) is bounded to
the sender's workspace — its present participants (sessions whose
heartbeat is within presence_ttl_seconds; the same single predicate
list_present uses) — and excludes the sender. Agents excluded because every
active session of theirs is heartbeat-stale are reported explicitly in
excluded_stale + a warning — reach them with a direct message (delivered
regardless of presence). A direct target naming an unregistered agent is
NOT_FOUND (the whole send rolls back); a group target matching nobody
succeeds with recipients: [] + a warning (never a silent drop).
direct_with_fallback and a broadcast nested in a mixed are rejected
(VALIDATION_ERROR) — they would broadcast globally; model timed escalation
as a handoff.
Index-free lanes, at-least-once, with a dead letter. Each delivery moves
unread → delivered (in-flight, leased) → read (history). inbox_pull
atomically claims your claimable deliveries (unread + your own lease-expired
redeliveries) into in-flight and returns them with their body — no cursor;
inbox_extend renews the lease mid-turn; inbox_ack settles them into history;
an unacked in-flight delivery whose lease elapses is redelivered (exactly
the handoff lease mechanic), and a delivery that exhausts its claim attempts is
parked (dead-letter — visible via inbox_peek(include_parked=true) for the
recipient and message_status for the sender). inbox_count/inbox_peek are
READ-ONLY (no sweep — expired leases are projected as unread at read
time) and are the cheap between-turns check; inbox_history is the read
archive (keyset-paginated). This is the durable, lossless replacement for the
V1 cursor/message_wait model.
Routing / visibility / eligibility
Eligibility/visibility live in domain/routing.py; the target grammar
(shape validation) lives in domain/targets.py — the single source of
truth every slice parses through (message send validation, handoff
descriptors, routing), so a target accepted on one write path can never be
rejected or silently reinterpreted on another. All are total, deterministic,
I/O-free functions. Malformed input → VALIDATION_ERROR; a well-formed rule
that simply does not match → False (not an error). Grammar constraints worth
knowing: mixed requires a non-empty rules list, a null/blank sub-rule
is rejected (it would silently resolve to a covert broadcast), and a
broadcast nested in a mixed is rejected everywhere — use a bare top-level
{"strategy": "broadcast"} instead.
Six target strategies (is_agent_eligible) — discriminator strategy/kind,
case-insensitive, -/space normalized to _:
| Strategy | Rule |
|---|---|
direct |
agent_id == target.agent_id (exact, opaque ids) |
capability |
capability held by the agent — exact, case-sensitive (string = membership; list = any-of); preferred is advisory only |
role |
agent.role == target.role (exact, case-sensitive) |
broadcast |
every agent in the workspace is eligible |
mixed |
union/OR of target.rules (each a sub-target); short-circuits on first match |
direct_with_fallback |
the direct agent is always eligible; after now >= created_at + fallback_after_seconds (inclusive, monotonic) the fallback (default broadcast) also becomes eligible |
An absent/blank target means no restriction → broadcast (everyone
eligible). direct_with_fallback requires valid created_at and now.
Three visibilities (can_agent_see_event), layered over eligibility and
strictly workspace-isolated (agent and item must share the same
workspace_id, both present, else False):
| Visibility | Who sees it |
|---|---|
public |
any agent in the same workspace |
eligible |
only agents is_agent_eligible approves |
private |
eligible-only; for non-direct targets, private == eligible (never widens beyond the eligible set) |
Default when the log's visibility is null: public. Sender
carve-out: the item's actor (actor_agent_id) always sees its own event —
eligibility never excludes a sender from its own audit trail (ADR 0001:
"you have a delivery row or you are the sender"). The emit path is
fail-closed (validate_emit_vocabulary): an out-of-vocabulary
stream/visibility (or a blank type) is rejected with INTERNAL_ERROR
before anything persists, so no event can ever be written that no consumer
could read.
Claimability ≠ visibility. Seeing is not acting. An item can be public
(visible to the whole workspace) while only its target may claim it. So even a
universally visible handoff still checks is_agent_eligible on claim and, if it
fails, returns NOT_ELIGIBLE_TO_CLAIM.
Handoffs & leases
Status set: OPEN, CLAIMED, COMPLETED, REJECTED, CANCELLED.
(IN_PROGRESS, BLOCKED, EXPIRED are reserved for forward-compat, with no
producer.)
(none) --create--> OPEN
OPEN --claim--> CLAIMED
OPEN --reject (direct)--> REJECTED
OPEN --cancel (creator)--> CANCELLED
CLAIMED --complete (owner)--> COMPLETED
CLAIMED --reject (owner)--> REJECTED
CLAIMED --expire_old_leases--> OPEN
COMPLETED / REJECTED / CANCELLED are terminal; any transition off
this table → INVALID_TRANSITION. On handoff_create, visibility is
mandatory and explicit (missing/invalid → VALIDATION_ERROR). Owner rules:
complete only by claimed_by (NOT_OWNER) and only from CLAIMED; reject
by the owner of a CLAIMED, or by the direct target of an as-yet-unclaimed
OPEN; cancel only by the creator (from_agent_id) and only while OPEN
(a CLAIMED handoff is resolved by its claimant, or cancel it after the lease
expires and it reopens).
Create-time recipient policy (D1b, mirrors message_create). A direct
target naming an unregistered agent is a hard NOT_FOUND (full rollback —
a typo'd handoff must not sit OPEN forever; use direct_with_fallback for an
agent that will register later). A pool target
(capability/role/mixed/broadcast/direct_with_fallback) matching zero
currently-registered agents succeeds with eligible_count: 0 + an explicit
warning — never silent, never fatal (eligibility is lazily re-evaluated at
claim time, so a later registrant can still claim; handoff_cancel retracts a
mistake).
Outcome persistence + inbox notifications. A directed handoff
(direct/direct_with_fallback naming a registered agent) lands one synthetic
notification in the named agent's inbox in the same transaction
(notified in the response) — the wakeup needs no polling; claiming still
happens via handoff_claim. handoff_complete persists result (and
handoff_reject persists rejected_reason) on the handoff row and
notifies the creator's inbox with the outcome. handoff_get is the
creator's path to a finished handoff's status/result by id — terminal handoffs
leave handoff_list_available, and lifecycle events are observability, not
delivery.
Atomic claim — single winner, no TOCTOU. Claiming is one conditional UPDATE:
UPDATE handoffs
SET status='CLAIMED', claimed_by=?, lease_expires_at=?, updated_at=?
WHERE handoff_id=? AND status='OPEN' AND workspace_id=?
rowcount == 1 → winner (no select-then-update, no race). rowcount == 0 →
reload to classify: present in workspace → HANDOFF_ALREADY_CLAIMED; present in
another → WORKSPACE_MISMATCH; else NOT_FOUND. No event is emitted on the
failure path. Before the UPDATE, the service still gates eligibility
(NOT_ELIGIBLE_TO_CLAIM).
Leases (TTL) and opportunistic expiry. On claim,
lease_expires_at = now + handoff_lease_ttl_seconds. expire_old_leases runs
opportunistically before handoff_list_available and handoff_claim (no
background job). Threshold is strict: lease_expires_at < now expires;
== now does not. Expiring reopens CLAIMED → OPEN (clearing
claimed_by/lease_expires_at, again via a conditional UPDATE) and emits
handoff.expired in the same transaction — returning abandoned work to the pool
without a dedicated reaper. handoff_list_available returns the OPEN handoffs
that are visible and eligible to the caller, paginated, with optional
long-poll (same waiter-gated mechanics and snapshot-by-default as event_wait).
Trust. handoff_claim / handoff_complete / handoff_reject are
sensitive verbs: in trust_mode=strict they require session_id +
session_secret (from session_open); in open mode the credentials are
optional but a supplied secret is always validated.
Artifacts & path safety
Type whitelist (closed, case-insensitive): {file, text, json, markdown}.
Inline limit is 64 KB inclusive (ensure_within_inline_limit measures UTF-8
byte length; <= 65536 accepted, one byte over → CONTENT_TOO_LARGE with a hint
to store as artifact_type='file' + path). For json, well-formedness is
validated after the size ceiling (so a giant JSON is rejected by size, not an
expensive parse).
Path containment (adapters/outbound/file/store.py): every path reference
is checked with
resolved = realpath( join(root_real, relative_path) )
contained = commonpath([normcase(root), normcase(resolved)]) == normcase(root)
commonpath + realpath catch .. escapes, external absolute paths, and
symlinks whose real target leaves the root → PATH_OUTSIDE_WORKSPACE; normcase
covers case-insensitive filesystems (Windows), and a ValueError (e.g. different
drives) counts as not-contained. A path supplied alongside content must still
be contained — escaping fails the whole put.
Other invariants: artifact_id is server-assigned (art…) and immutable;
artifact.created is emitted in the same UoW on the workspace stream with
visibility="public"; size is the content byte length (inline) or os.path.getsize
(path — a stat, never a read). artifact_get returns NOT_FOUND for an
unknown or other-workspace id (no leak); a stored=path artifact returns only
path + metadata, never the referenced file's bytes.
shared.md — a derived view
shared_md_render writes a per-workspace, human-readable view at
{home}/workspaces/{workspace_id}/shared.md. The workspace_id must be 64-hex
lowercase (VALIDATION_ERROR otherwise; WORKSPACE_REQUIRED if absent;
NOT_FOUND if well-formed but nonexistent). It is never a source of truth —
SQLite (WAL) is the only truth, and this module only produces the file (the
system never reads it back). The renderer writes a temp file in the same
directory and os.replaces it over shared.md (atomic overwrite — no
observer sees a partial file; an FS failure → RENDER_ERROR with no partial
left behind).
Four fixed-order sections: (1) relevant agents / active sessions, (2) open tasks,
(3) open handoffs, (4) recent events (newest-first by descending event_id,
capped by limit_events, default 50, ceiling 1000). The read is a single
read-only transaction over committed state and embeds no wall-clock, so two
renders over the same state are byte-identical. shared.md is the
workspace-public view: the routing target of a non-public handoff
(eligible/private) is redacted as [private] instead of rendered raw, and
handoff payloads are never rendered.
Tool Reference
35 MCP tools — 34 across seven slices, auto-discovered from
adapters/inbound/mcp/tools/, plus the server-level nexus_info. Every tool
returns the canonical envelope (success {ok:true,data} / failure
{ok:false,error}); the @tool_envelope decorator guarantees no exception
crosses the boundary. Consequently every tool may also return
INTERNAL_ERROR (and DB_ERROR when a SQLite repo fails) in addition to the
codes listed below.
nexus_info
Report the server's versions — call it when behaviour seems to disagree with your cached tool schemas.
- Request: none.
- Data:
{package_version, schema_version, surface_revision}—surface_revisionincrements on every change to the tool surface (names/parameters/defaults/semantics);schema_versionis the highest applied DB migration;package_versionis the installed distribution version ("dev"for a source checkout). - Errors: none specific (boundary
INTERNAL_ERROR/DB_ERRORonly).
Identity & Workspace
workspace_resolve
Resolve project_root to its deterministic workspace_id and upsert the
workspace row.
- Request:
project_root: str(required);display_name: str | None(defaultNone). - Data:
{workspace_id, display_name, root_realpath, created_at, last_seen_at}. - Errors:
VALIDATION_ERROR(missing/empty or non-absoluteproject_root),WORKSPACE_UNRESOLVED(unresolvable realpath).
workspace_list
GLOBAL-ADMIN — enumerate all workspaces (a deliberately cross-workspace
surface, alongside the global agent_list/agent_get).
- Request: none.
- Data:
{workspaces: [{workspace_id, display_name, root_realpath, created_at, last_seen_at}, …]}. - Errors: none specific (boundary
INTERNAL_ERROR/DB_ERRORonly).
agent_register
Upsert a logical, global agent identity; re-registering updates role/capabilities/metadata.
- Request:
agent_id: str(required);role: str | None;capabilities: Any;metadata: Any(all defaultNone). - Data:
{agent_id, role, capabilities, metadata, created_at, updated_at, last_seen_at}. - Errors:
VALIDATION_ERROR(missingagent_id; or capabilities/metadata not JSON-serializable),CONTENT_TOO_LARGE(capabilities/metadata inline >max_inline_bytes).
last_seen_at(presence). Every agent-attributed operation stamps the agent'slast_seen_at(best-effort; a no-op for an unregistered actor): the identity ops (agent_register,session_*),message_create, the inbox ops (inbox_pull/inbox_ack), the handoff mutations (create/claim/complete/reject), andevent_get/event_wait. Surfaced byagent_listandagent_get.
agent_list
Enumerate all registered agents (global; the agent-discovery surface for
addressing — find an agent_id before a direct message / directed handoff).
- Request: none.
- Data:
{agents: [{agent_id, role, capabilities, metadata, created_at, last_seen_at}, …]}, ordered bycreated_at.last_seen_atis the agent's most recent action (nullif it never acted). - Errors: none specific (boundary
INTERNAL_ERROR/DB_ERRORonly).
agent_get
Return one agent's full details, including last_seen_at (its latest interaction).
- Request:
agent_id: str(required). - Data:
{agent_id, role, capabilities, metadata, created_at, last_seen_at}. - Errors:
VALIDATION_ERROR(missingagent_id),NOT_FOUND(no such agent).
capability_list
GLOBAL — enumerate the distinct capabilities advertised across all registered
agents (discovery for capability-targeted addressing: know a capability exists
and who would match it before a target: {strategy:"capability"}).
- Request: none.
- Data:
{capabilities: [{capability, agent_count, agents:[…]}, …]}, sorted bycapability(andagentssorted per capability). Normalised exactly as capability routing matches: a flag-mapping keeps truthy keys, a list/string is its set; a falsey flag or blank name is excluded (so the advertised set equals the addressable set). - Errors: none specific (boundary
INTERNAL_ERROR/DB_ERROR). Malformed capabilities are rejected atagent_register, so a stored value always normalises cleanly here.
session_open
Open a session bound immutably to (agent_id, workspace_id); server assigns
session_id (ses…) and a per-session session_secret (uuid4, returned
ONLY here — keep it for the sensitive verbs). Emits session.opened and
opportunistically reaps the workspace's sessions silent past
session_reap_seconds.
- Request:
agent_id: str(required);workspace_id: str | None(defaultNone, but required in practice);metadata: Any(defaultNone). - Data:
{session_id, agent_id, workspace_id, status:"active", started_at, last_heartbeat_at, session_secret}. - Errors:
WORKSPACE_REQUIRED,VALIDATION_ERROR(missingagent_id; non-serializable metadata),CONTENT_TOO_LARGE(metadata inline),NOT_FOUND(workspace or agent nonexistent). No row created on failure.
session_heartbeat
Advance last_heartbeat_at and report derived status (active/stale). Emits
session.heartbeat. Heartbeating keeps you present (in the broadcast
audience, window presence_ttl_seconds) and clear of the opportunistic
stale-session reaper (which this call also runs for the workspace's long-dead
sessions).
- Request:
session_id: str(required);workspace_id: str | None(defaultNone; if given, membership is validated). - Data:
{session_id, status, last_heartbeat_at}. - Errors:
VALIDATION_ERROR(missingsession_id),NOT_FOUND,WORKSPACE_MISMATCH. No mutation on failure.
session_close
Close a session (idempotent); repeating returns ok and stays closed. Only the
transition to closed emits session.closed.
- Request:
session_id: str(required);workspace_id: str | None(defaultNone). - Data:
{session_id, agent_id, workspace_id, status:"closed", started_at, last_heartbeat_at, closed_at}. - Errors:
VALIDATION_ERROR(missingsession_id),NOT_FOUND,WORKSPACE_MISMATCH.
Events & Polling
Valid streams:
{workspace, agent, handoff}. Validfilterskeys (equality, AND-combined):{type, agent_id, task_id, handoff_id}. Each event:{event_id, workspace_id, stream, type, payload, actor_agent_id, task_id, handoff_id, created_at}.limitdefaults to 100, max 1000 (overrideOKTO_NEXUS_MAX_EVENT_LIMIT).
event_get
Non-blocking, cursor-paginated read of the workspace event log (filters + visibility applied before the envelope).
- Request:
project_root: str(required);agent_id: str(required);stream: str(required);cursor: int | None(None→0);limit: int | None(None→100);filters: dict | None(defaultNone). - Data:
{events: [...], next_cursor: int, has_more: bool, timed_out: false}. - Errors:
WORKSPACE_REQUIRED(missingproject_root),INVALID_STREAM,VALIDATION_ERROR(missingagent_id; non-int/negative cursor; non-int/<1limit; non-mapping filters or unknown key),WORKSPACE_UNRESOLVED. Precedence:WORKSPACE_REQUIRED→INVALID_STREAM→VALIDATION_ERROR→WORKSPACE_UNRESOLVED.
event_wait
Read the event log; optionally long-poll until the first non-empty page or the timeout (clamped to the configured ceiling), without socket/thread.
- SAFE BY DEFAULT:
timeout_secondsomitted,0ornullis an immediate non-blocking snapshot (single scan, no sleep). Blocking is an explicit opt-in:timeout_seconds > 0parks the caller's turn until an event arrives or the timeout expires (clamped tomax_wait_timeout_seconds). - Request: same as
event_get+timeout_seconds: int | None(default0). - Data:
{events: [...], next_cursor: int, has_more: bool, timed_out: bool}. On timeout:events:[],next_cursor= entry cursor,timed_out: true. - Errors: same as
event_get+VALIDATION_ERROR(non-inttimeout_seconds;boolrejected).
Channels & Messages
Seeded channels per workspace: only
general(agents create the rest withchannel_create). Channels are organizational labels, not access boundaries — they never decide who receives a message (thetargetdoes). Reading a message is the Inbox slice's job —message_createonly sends; the per-recipientinbox_*tools receive.
message_create
Persist the message, resolve its recipients and fan one inbox delivery per
recipient, and emit message.created — all in the same (atomic) transaction.
- Request:
project_root: str(required);from_agent_id: str(required);subject: str(required);body: str(required);channel_id: str | None;from_session_id: str | None(REQUIRED withsession_secretintrust_mode=strict);target: dict | None;artifacts: list[str] | None;parent_message_id: str | None;session_secret: str | None(validated whenever supplied; required instrict) (all defaultNone). - Data: message shape +
event_id,recipients: [...],delivered_count, an optionalwarning(group target matched nobody, or agents were excluded for staleness), an optionalexcluded_stale: [...](agents whose every active session has a heartbeat older thanpresence_ttl_seconds— they will NOT receive this broadcast; reach them with adirectmessage), and an optionalworkspace_created: true(present only when the send materialised a brand-newworkspacesrow; absent in the steady state — additive key, existing clients unaffected). Recipient resolution:direct/capability/role/mixed→ the global registry; barebroadcast/no-target → this workspace's present participants (active sessions with a fresh heartbeat; the sender is excluded from group targets). - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing/empty from_agent_id/subject/body; malformed target;direct_with_fallbackor abroadcastnested inmixed; artifacts not a list; missing credentials intrust_mode=strict),CONTENT_TOO_LARGE(subject/body >max_inline_bytes),NOT_FOUND(channel, parent message, unknown session_id, or adirecttarget that is not a registered agent),NOT_OWNER(session_secret mismatch, or session belongs to another agent),INTERNAL_ERROR.
Inbox — message delivery (ADR 0001)
The inbox is global (keyed by
agent_id): a direct message reaches you in any workspace. Lanes per recipient:unread→delivered(in-flight, leased) →read(history), plusparked(dead-letter). At-least-once: an unacked pull is redelivered after the lease (OKTO_NEXUS_INBOX_LEASE_TTL_SECONDS, default 300); a delivery claimed 5 times (DEFAULT_MAX_DELIVERY_ATTEMPTS) without an ack is parked instead of redelivered.limitdefaults to 50, max 200.inbox_peek/inbox_count/inbox_history/message_statusare READ-ONLY (unit_of_work(write=False)— polling never takes the WAL writer lock); onlypull/ack/extendwrite. The three writers are sensitive verbs: intrust_mode=strictthey requiresession_id+session_secret(fromsession_open); inopenmode a supplied secret is still validated (NOT_OWNERon mismatch).
inbox_pull
Atomically claim your claimable deliveries (unread + your own lease-expired
redeliveries) into in-flight and return them with their body; index-free
(no cursor).
- Request:
agent_id: str(required);limit: int | None;lease_seconds: int | None(size the lease for long turns; clamped to 10–3600, default = lease TTL);session_id: str | None+session_secret: str | None(trust credentials — required instrict). - Data:
{messages: [{delivery_id, message_id, status, delivered_at, lease_expires_at, read_at, from_agent_id, workspace_id, channel_id, subject, body, target, artifacts, created_at}, …], count}. - Note: a delivery on its 5th claim is parked (dead-letter) instead of
returned; parked rows consume slots of THAT claim's
limit, so a pull that parks poison messages may return fewer items — the next pull brings the rest. - Errors:
VALIDATION_ERROR(missingagent_id; non-positivelimit; non-int/boollease_seconds; missing credentials instrict),NOT_FOUND/NOT_OWNER(bad trust credentials).
inbox_ack
Move messages to history (read), freeing the queue. Idempotent by message_id.
- Request:
agent_id: str(required);message_ids: str | list[str](required);session_id: str | None+session_secret: str | None(trust credentials — required instrict). - Data:
{acknowledged: int}(rows transitioned). Errors:VALIDATION_ERROR,NOT_FOUND/NOT_OWNER(bad trust credentials).
inbox_extend
Renew the lease on messages you pulled but have not finished handling (long turns), avoiding duplicate redelivery.
- Request:
agent_id: str(required);message_ids: str | list[str](required);extend_seconds: int(required; clamped to 10–3600);session_idsession_secret(trust credentials — required instrict). New lease = now +extend_seconds.
- Data:
{extended: int, lease_expires_at: str}. - All-or-nothing: if ANY id is not currently in-flight for you (never pulled / lease already expired / already acknowledged / parked) the call fails with a per-message reason and nothing is extended.
- Errors:
VALIDATION_ERROR.
inbox_peek
READ-ONLY view of pending (unread + in-flight): nothing is leased, moved, or
swept — an in-flight delivery whose lease elapsed is shown as unread (its
effective lane, computed at read time).
- Request:
agent_id: str(required);limit: int | None;include_parked: bool(defaultfalse—truealso surfaces the dead-letter lane). - Data:
{messages: [...], count}. Errors:VALIDATION_ERROR.
inbox_count
READ-ONLY lane sizes {unread, in_flight, read}; an elapsed in-flight lease is
counted as unread (effective lane) without sweeping it. Parked deliveries
are excluded — inspect them with inbox_peek(include_parked=true).
- Request:
agent_id: str(required). Errors:VALIDATION_ERROR.
inbox_history
The read archive, newest-first, keyset-paginated (READ-ONLY): the opaque
cursor pins the page boundary to the last row seen, so acks between pages never
duplicate/hide items.
- Request:
agent_id: str(required);cursor: str | None(opaque — pass backnext_cursorverbatim; legacy numeric offset cursors are rejected with a prescriptiveVALIDATION_ERROR);limit: int | None. - Data:
{messages: [...], next_cursor: str | None, has_more: bool}. - Errors:
VALIDATION_ERROR.
message_status
Sender-side observability (READ-ONLY): track a message you SENT instead of re-sending when a recipient seems silent.
- Request:
message_id: str(required — as returned bymessage_create). - Data:
{message_id, deliveries: [{recipient, status, attempts, read_at}, …], count}wherestatusis the per-recipient effective lane:unread(queued, or lease expired awaiting redelivery),delivered(in flight),read(acknowledged),parked(dead-lettered after exhausting attempts). - Errors:
VALIDATION_ERROR(missingmessage_id),NOT_FOUND.
channel_create
Create a channel by name — idempotent by name (creating an existing name returns it). Channels are organizational labels, not ACLs.
- Request:
project_root: str(required);name: str(required — trimmed, ≤ 64 chars, no control characters, unique per workspace). - Data:
{channel: {channel_id, workspace_id, name, created_at}, created: bool}(created: falsewhen the name already existed). - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(blank/overlong name or control characters).
channel_list
Return the workspace's channels (idempotently seeds the general default first).
- Request:
project_root: str(required). - Data:
{channels: [{channel_id, workspace_id, name, created_at}, …]}. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED.
message_get / message_list / message_wait (migration shims)
Removed in S3 and kept as permanent shims: each always answers
ok:false with code=MIGRATED and a prescriptive message naming the exact
replacement call (details.replacements lists the new tools;
details.removed_in: "S3"). The shims declare no parameters, so every legacy
call shape lands on the guidance instead of failing schema validation.
Replacements: message_get → inbox_pull/inbox_peek/inbox_history (or
event_get for bus traffic); message_list → inbox_peek/inbox_history/
event_get; message_wait → inbox_count polling + inbox_pull (or
event_wait for explicit blocking).
Handoffs
Status:
OPEN/CLAIMED/COMPLETED/REJECTED/CANCELLED. Visibilities:{private, eligible, public}. Target strategies:{direct, capability, role, broadcast, mixed, direct_with_fallback}.handoff_list_availablelimitdefaults to 100, max 500. The mutating verbsclaim/complete/reject/cancelare trust-gated (session_id+session_secretrequired intrust_mode=strict).
handoff_create
Create an OPEN handoff (validating target/visibility/limit) and emit
handoff.created. The payload (inline request body / work content) is stored
with the row and returned by handoff_list_available/handoff_claim, so a
worker reads the work without correlating the handoff.created event — pass an
artifact_id reference for large content. A direct target must name a
registered agent (else NOT_FOUND, full rollback) and lands an inbox
notification for it; a pool target matching 0 agents succeeds with an explicit
warning. After creating, poll handoff_get for the outcome.
- Request:
project_root: str(required);from_agent_id: str(required);target: Any(required — descriptor withstrategy+ per-strategy fields);visibility: str(required — one of{private, eligible, public});payload: Any(any JSON value — a string round-trips byte-for-byte, a non-string is stored/returned as opaque JSON text);session_id: str | None(defaultNone). - Data:
{handoff_id, workspace_id, status:"OPEN", created_at}+ for a directed target an optionalnotified: [agent_id](the inbox notification landed); for pool targetseligible_countand, when it is0, an explicitwarning(the handoff staysOPENfor later registrants — retract a mistake withhandoff_cancel). - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing from_agent_id; missing/malformed target or unknown strategy/missing field; missing/unknown visibility),NOT_FOUND(directtarget naming an unregistered agent),CONTENT_TOO_LARGE(payload inline).
handoff_list_available
Expire stale leases, then list OPEN handoffs visible and eligible to the
caller (paginated, with optional long-poll).
- Request:
project_root: str(required);agent_id: str(required);cursor: str | None(None→0);limit: int | None(None→100);timeout_seconds: int | None(None→0 = no long-poll; clamped tomax_wait_timeout_seconds). - Data:
{handoffs: [{handoff_id, status, target, visibility, from_agent_id, payload, created_at}, …], next_cursor: str | None, has_more: bool, timed_out: bool}. Each entry carries thepayloadso a worker can triage before claiming. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing agent_id; non-int/negative cursor; non-int/<=0/boollimit; non-numeric/negative/booltimeout_seconds).
handoff_claim
Atomically claim an OPEN handoff (single winner); expires leases first and gates
on eligibility. Emits handoff.claimed.
- Request:
project_root: str(required);handoff_id: str(required);agent_id: str(required);session_id: str | None+session_secret: str | None(trust credentials — required intrust_mode=strict; a supplied secret is validated even inopen). - Data:
{handoff_id, workspace_id, status:"CLAIMED", claimed_by, lease_expires_at, payload}. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing handoff_id/agent_id; missing credentials instrict),NOT_FOUND,WORKSPACE_MISMATCH,NOT_ELIGIBLE_TO_CLAIM,HANDOFF_ALREADY_CLAIMED,NOT_OWNER(bad trust credentials).
handoff_complete
Owner-only transition CLAIMED → COMPLETED. Emits handoff.completed,
persists result on the row (read it back with handoff_get), and
notifies the creator's inbox with the outcome.
- Request:
project_root: str(required);handoff_id: str(required);agent_id: str(required);result: Any(defaultNone— string verbatim, non-string as opaque JSON text);session_id+session_secret(trust credentials — required instrict). - Data:
{handoff_id, status:"COMPLETED"}+ optionalnotified: [creator](absent when the creator is unregistered or is the completing agent itself). - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing ids; non-serializable result; missing credentials instrict),CONTENT_TOO_LARGE(result inline),NOT_FOUND,WORKSPACE_MISMATCH,INVALID_TRANSITION(status ≠ CLAIMED),NOT_OWNER(agent_id ≠ claimed_by; or bad trust credentials).
handoff_reject
Reject: owner CLAIMED → REJECTED or direct-target OPEN → REJECTED. Emits
handoff.rejected, persists rejected_reason on the row (read it back
with handoff_get), and notifies the creator's inbox.
- Request:
project_root: str(required);handoff_id: str(required);agent_id: str(required);reason: str | None(defaultNone);session_idsession_secret(trust credentials — required instrict).
- Data:
{handoff_id, status:"REJECTED"}+ optionalnotified: [creator]. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing ids; non-serializable reason; missing credentials instrict),CONTENT_TOO_LARGE(reason inline),NOT_FOUND,WORKSPACE_MISMATCH,INVALID_TRANSITION(terminal/invalid state),NOT_OWNER(non-owner of CLAIMED; non-direct-target of OPEN; or bad trust credentials).
handoff_cancel
Creator-only transition OPEN → CANCELLED — retract a handoff nobody should
take anymore (e.g. created with a pool target that matched zero agents). Emits
handoff.cancelled (the optional reason rides the event payload).
- Request:
project_root: str(required);handoff_id: str(required);agent_id: str(required — must be the creator);reason: str | None(defaultNone);session_id+session_secret(trust credentials — required instrict). - Data:
{handoff_id, status:"CANCELLED"}. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing ids; missing credentials instrict),CONTENT_TOO_LARGE(reason inline),NOT_FOUND,WORKSPACE_MISMATCH,NOT_OWNER(caller is not the creator; or bad trust credentials),INVALID_TRANSITION(CLAIMED — resolved by its claimant, or cancel after the lease expires; or already terminal).
handoff_get
Read ONE handoff by id — the creator's path to a finished handoff's outcome
(status, claimant, payload, result/rejected_reason). Runs opportunistic
lease expiry first, so an elapsed claim reads as OPEN again. READ access: the
creator and the claimant always may; any other agent is gated by the handoff's
visibility (the same predicate as handoff_list_available).
- Request:
project_root: str(required);handoff_id: str(required);agent_id: str(required). - Data:
{handoff_id, workspace_id, status, from_agent_id, target, visibility, claimed_by, lease_expires_at, payload, result, rejected_reason, created_at, updated_at}. - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing ids),NOT_FOUND,WORKSPACE_MISMATCH,NOT_OWNER(not creator/claimant and not admitted by visibility).
Artifacts
Types:
{file, text, json, markdown}(case-insensitive).stored ∈ {inline, path}.
artifact_put
Register a file/text/json/markdown artifact in the resolved workspace and
emit artifact.created in the same transaction.
- Request:
project_root: str(required);artifact_type: str(required);name: str | None;path: str | None;content: str | None;metadata: Any(all defaultNone). Requires at least one ofpathorcontent. - Data:
{artifact_id, workspace_id, artifact_type, stored, size_bytes, created_at}(+pathwhenstored="path"). - Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing/non-whitelisted artifact_type; neither path nor content; non-string content; malformed JSON when type=json; non-serializable metadata),CONTENT_TOO_LARGE(content inline >max_inline_bytes, inclusive),PATH_OUTSIDE_WORKSPACE(path escapes the workspace root).
artifact_get
Retrieve an artifact by id within the resolved workspace; a stored=path artifact
returns only path + metadata (never the file's bytes).
- Request:
project_root: str(required);artifact_id: str(required). - Data:
{artifact_id, workspace_id, artifact_type, stored, size_bytes, metadata, created_at}content(ifinline) orpath(ifpath).
- Errors:
WORKSPACE_REQUIRED,WORKSPACE_UNRESOLVED,VALIDATION_ERROR(missing artifact_id),NOT_FOUND(unknown or other-workspace id — no leak).
shared.md
shared_md_render
Render the workspace's human-readable shared.md (atomic overwrite at
{home}/workspaces/{workspace_id}/shared.md), four deterministic sections.
- Request:
workspace_id: str(required — 64-char lowercase hex);limit_events: int(default 50; must be 1..max, max default 1000 viaOKTO_NEXUS_MAX_SHARED_MD_EVENTS). - Data:
{path, workspace_id, bytes_written, sections_rendered, limit_events, generated_at}. - Errors:
WORKSPACE_REQUIRED(missing/empty workspace_id),VALIDATION_ERROR(non-hex-64 workspace_id; non-positive-int/bool/out-of-range limit_events),NOT_FOUND(well-formed but nonexistent),DB_ERROR(SQLite read failure),RENDER_ERROR(atomic-write failure).
Data Model
SQLite is the single source of truth. Workspace invariant: every coordinated
entity carries workspace_id TEXT NOT NULL REFERENCES workspaces(workspace_id) —
except schema_migrations, workspaces (the root itself), and agents (global,
unscoped identities). Timestamps are UTC ISO-8601 TEXT; JSON-ish columns
(payload, capabilities, metadata, artifacts, target) are TEXT.
workspace_id = sha256(realpath(root)).
| Table | PK | workspace_id? |
Key columns | FKs / UNIQUE / indexes |
|---|---|---|---|---|
schema_migrations |
version INTEGER |
no | applied_at |
ledger of applied migrations |
workspaces |
workspace_id TEXT |
is the root | display_name, root_realpath, created_at, last_seen_at |
— |
agents |
agent_id TEXT |
no (global) | role, capabilities, metadata, created_at, last_seen_at (migr. 004) |
— |
sessions |
session_id TEXT |
yes | agent_id, status, started_at, last_heartbeat_at, closed_at (migr. 002), session_secret (migr. 007) |
FK→agents, FK→workspaces; idx (workspace_id,status), (workspace_id,agent_id) |
events |
event_id INTEGER AUTOINCREMENT |
yes | stream, type, actor_agent_id, payload, visibility, target, created_at |
append-only/immutable; FK→workspaces; idx (workspace_id,event_id), (workspace_id,stream,event_id) |
channels |
channel_id TEXT |
yes | name, created_at |
FK→workspaces; UNIQUE(workspace_id,name); idx (workspace_id,name) |
messages |
message_id TEXT |
yes | from_agent_id, channel_id, from_session_id, target, subject, body, artifacts, parent_message_id, created_at |
FK→workspaces, FK→channels, self-FK→messages; idx (workspace_id,created_at), (workspace_id,channel_id,created_at) |
message_deliveries (migr. 005/006) |
delivery_id TEXT |
no (global inbox) | message_id, recipient_agent_id, status (unread/delivered/read/parked), delivered_at, lease_expires_at, read_at, attempts (migr. 006), created_at |
FK→messages, FK→agents; UNIQUE(message_id,recipient_agent_id); idx (recipient_agent_id,status), (status,lease_expires_at), keyset idx (recipient_agent_id,status,read_at DESC,delivery_id DESC) |
tasks |
task_id TEXT |
yes | title, description, status, created_by, created_at |
FK→workspaces; idx (workspace_id,status) |
handoffs |
handoff_id TEXT |
yes | task_id, from_agent_id, target, visibility, status, claimed_by, lease_expires_at, payload (migr. 003), result + rejected_reason (migr. 008), created_at, updated_at |
FK→workspaces, FK→tasks; idx (workspace_id,status), (workspace_id,target,status) |
artifacts |
artifact_id TEXT |
yes | artifact_type, name, path, content, size_bytes, content_type, created_at |
FK→workspaces; idx (workspace_id,artifact_type) |
Migrations (shipped inside the package at src/okto_nexus/migrations/).
001_core.sql defines the core schema; 002_session_close.sql,
003_handoff_payload.sql, and 004_agent_last_seen.sql
are forward-only ALTER TABLE … ADD COLUMN migrations (adding sessions.closed_at,
handoffs.payload, and agents.last_seen_at); 005_message_deliveries.sql adds
the global message_deliveries inbox table (ADR 0001);
006_inbox_delivery_hardening.sql adds message_deliveries.attempts (backs the
parked dead-letter lane) and the keyset history index;
007_session_secret.sql adds sessions.session_secret (the trust credential,
ADR 0002); 008_handoff_result.sql adds handoffs.result +
handoffs.rejected_reason (terminal-outcome persistence behind handoff_get).
The runner discovers migrations/NNN_*.sql (regex ^(\d+)_.*\.sql$), orders by
numeric version, and applies each pending one in its own BEGIN IMMEDIATE
transaction, recording (version, applied_at) in schema_migrations —
durable per migration: a failure in migration N keeps 1..N-1 committed
(forward-only by design; there is no all-or-nothing bootstrap rollback). The
ledger is re-checked under the write lock, so concurrent bootstraps are safe,
and a ledger version newer than the package knows fails closed. The statement
splitter is line-based (blank and -- lines dropped; a statement boundary is a
line whose content ends in ;) — so ; may not be embedded in literals.
apply() is idempotent (returns [] when already current); any failure →
best-effort ROLLBACK of the failing migration + MIGRATION_ERROR.
Response Envelope & Error Catalog
Every tool returns a canonical envelope; data and error are mutually
exclusive.
Success — ok(data):
{ "ok": true, "data": { "session_id": "ses-123", "status": "active" } }
data=None becomes {} — the data key is always a present mapping.
Failure — err(code, message, details?):
{
"ok": false,
"error": {
"code": "WORKSPACE_REQUIRED",
"message": "No workspace was provided.",
"details": { "field": "workspace_id" }
}
}
details appears only when truthy.
The tool_envelope decorator is the single choke point of the inbound
adapter — it guarantees no exception crosses the boundary:
- a handler returning a dict that already has an
okkey passes through unchanged; - any other mapping is wrapped by
ok(...); a non-mapping value becomesok({"result": value}); - an
OktoNexusErrorbecomes itsto_error_dict()failure envelope; - any other exception becomes
INTERNAL_ERRORwithdetails.exception = type(exc).__name__.
Closed catalog of 18 error codes (errors.py, a frozen, normative
frozenset; each serializes as its own string value):
| Code | Meaning |
|---|---|
WORKSPACE_REQUIRED |
Operation needs a workspace and none was provided. |
WORKSPACE_UNRESOLVED |
Workspace could not be resolved from context (bad/broken path). |
WORKSPACE_MISMATCH |
Entity belongs to a different workspace than requested. |
VALIDATION_ERROR |
Invalid input / payload validation failure. |
NOT_FOUND |
Target entity (session, task, handoff, …) does not exist. |
NOT_OWNER |
Actor is not the owner of the entity it tried to mutate. |
INVALID_TRANSITION |
State transition not allowed by the state machine. |
INVALID_STREAM |
Invalid/unknown event stream. |
HANDOFF_ALREADY_CLAIMED |
Handoff already claimed (lease still valid). |
NOT_ELIGIBLE_TO_CLAIM |
Caller is not eligible to claim the handoff. |
CONTENT_TOO_LARGE |
Inline content exceeds max_inline_bytes. |
PATH_OUTSIDE_WORKSPACE |
Resolved path escapes the workspace root. |
CONFIG_ERROR |
Invalid configuration (flag/env/value) at bootstrap. |
MIGRATION_ERROR |
Failure locating/applying migrations. |
DB_ERROR |
Failure opening/configuring the SQLite connection. |
RENDER_ERROR |
Failure rendering an output/representation. |
MIGRATED |
Tool was removed/renamed; the shim's message points at the replacement tool. |
INTERNAL_ERROR |
Catch-all for unexpected (unmapped) failures. |
to_envelope_error(exc) normalizes any exception: OktoNexusError verbatim,
everything else → INTERNAL_ERROR with "An unexpected internal error occurred.".
Transient failures carry details.retryable: true. SQLite lock/busy
contention (another process briefly holds the write lock) surfaces as
DB_ERROR with details.retryable = true and a message telling the caller to
simply retry the same call; the flag is absent on non-transient failures, so
agents can branch on it without parsing messages.
Operations
Operator-facing concerns of a long-running bus: the CLI subcommands, retention,
trust, and the presence/staleness windows. (All knobs are in
Configuration; design rationale in
docs/design/0002-design-review-hardening.md.)
CLI subcommands
The okto-nexus entry point dispatches on its first argument:
| Invocation | What it does |
|---|---|
okto-nexus [flags] |
Run the MCP stdio server (the default). |
okto-nexus tail … |
Follow the workspace event log as NDJSON (the background-monitor pattern — see Monitoring patterns). |
okto-nexus admin prune … |
Enforce the retention windows and print a JSON report. |
Unrecognised flags on tail/admin (e.g. --home, --db-path) are forwarded
verbatim to the same fail-closed bootstrap the server uses, so config
precedence (CLI > env > default) is identical everywhere.
Retention (admin prune / auto_prune_on_start)
The store is append-mostly by design — without pruning, nexus.db grows
without bound. RetentionService.prune deletes rows older than their window
under hard safety invariants:
- Only terminal lanes are eligible: events past
retention_events_keep_days(default 30 d),readdeliveries pastretention_read_deliveries_keep_days(14 d),closedsessions pastretention_closed_sessions_keep_days(7 d). - Never deleted, regardless of age:
unread/delivered/parkeddeliveries,active/stalesessions, and ALL handoffs, tasks, messages, channels, agents, workspaces and artifacts. - Bounded batches (500 rows each, one write transaction per batch) keep the WAL writer lock brief; concurrent agents keep flowing.
# Count first (deletes nothing, skips --vacuum):
okto-nexus admin prune --project-root <path> --dry-run
# Enforce the configured windows and compact the file:
okto-nexus admin prune --project-root <path> --vacuum
# Override a window for this run only:
okto-nexus admin prune --project-root <path> --events-keep-days 7
Scope is the whole store — every workspace shares one nexus.db and the
inbox is global; --project-root only anchors/validates the call. Deletes
alone leave free pages inside the file; only --vacuum shrinks it on disk.
auto_prune_on_start=true additionally runs one bounded, incremental pass at
server startup (at most 4 batches per table; best-effort — a failure is
reported to stderr and startup proceeds; backlog converges across restarts or
via a full admin prune).
Trust mode
trust_mode=open (default) keeps the cooperative posture: the sensitive verbs
(message_create, handoff_claim/complete/reject,
inbox_pull/ack/extend) accept optional credentials, but a supplied
session_secret is always validated — a wrong credential is never ignored.
trust_mode=strict requires session_id + session_secret (returned once by
session_open) on every sensitive verb. This is cooperative
authentication: it stops accidental/sloppy impersonation between local
agents, not a hostile process that can read nexus.db directly. Sessions
opened before the trust upgrade have no stored secret — close and reopen them.
Presence & staleness windows
Three read-time windows govern session liveness (no background threads):
| Window | Default | Effect when exceeded |
|---|---|---|
session_stale_ttl_seconds |
60 s | Session reports stale (row stays active). |
presence_ttl_seconds |
30 min | Session leaves the broadcast audience (sender sees excluded_stale). |
session_reap_seconds |
24 h | Session is opportunistically closed by the next session_open/session_heartbeat in the workspace. |
Troubleshooting
Symptoms an agent (or its operator) actually meets, and the prescriptive fix:
DB_ERRORwithdetails.retryable: true— transient WAL contention (another process briefly held the write lock). Retry the same call after a short backoff (~0.5–2 s). The flag is absent on real failures — never retry those blindly. Thetailfollower retries transients itself with bounded backoff and never advances the cursor across a failed poll.MIGRATEDerror — you called a tool removed in S3 (message_get,message_list,message_wait). The error'sdetails.replacementsnames the exact replacement calls (inbox_*,event_get/event_wait); switch to them — do not retry the old call.- Behaviour disagrees with your cached tool schemas — call
nexus_infoand comparesurface_revisionwith what you cached; it increments on every change to tool names/parameters/defaults/semantics. (MCP hosts typically don't re-discover tools mid-session — reconnect to refresh.) - A recipient seems silent — don't re-send.
message_status(message_id)shows the per-recipient lane:unread(not pulled yet, or lease expired and awaiting redelivery),delivered(in flight),read(acked),parked(dead-lettered after 5 unacked claims — inspect viainbox_peek(include_parked=true)on the recipient side). - Your broadcast reached fewer agents than expected — check the response's
excluded_stale: those agents' sessions are heartbeat-stale (older thanpresence_ttl_seconds). Reach them with adirectmessage (delivered regardless of presence) and remind them tosession_heartbeat. - You keep receiving the same message — you pulled it but never
inbox_acked it before the lease elapsed (at-least-once redelivery). Ack what you finish; for long turns size the lease (inbox_pull(lease_seconds=…)) or renew it (inbox_extend). VALIDATION_ERROR/NOT_OWNERon a sensitive verb — the server runstrust_mode=strict(or you supplied a wrong secret inopenmode). Pass thesession_id+session_secretreturned by your ownsession_open; a pre-upgrade session has no secret — open a new one.WORKSPACE_MISMATCH— the entity exists but belongs to another workspace. Check theproject_rootyou are passing; workspace identity issha256(realpath(project_root)), so two different paths to the same real directory are the same workspace, and different directories never share entities.
Example Flow
A realistic two-agent coordination, all over the canonical
{"ok": true, "data": …} envelope. This mirrors the live MCP client in
scripts/live_client.py, which spawns the real
server over the real stdio transport (a fresh OKTO_NEXUS_HOME temp dir) and
drives the full flow as any third-party MCP host would. The end-to-end smoke test
tests/test_e2e_smoke.py exercises the same path in-process.
workspace_resolve(project_root)→ deterministicworkspace_id(+ upserts theworkspacesrow).agent_register(agent_id="builder", role="builder", capabilities=["py"])and a secondrevieweragent.session_open(agent_id="builder", workspace_id=…)→ emitssession.openedand returns asession_secret(keep it — required on the sensitive verbs intrust_mode=strict);session_heartbeat(session_id=…)keeps itactiveand present (in the broadcast audience).message_create(project_root, from_agent_id="builder", subject, body, target={"strategy":"direct","agent_id":"reviewer"})→ persists the message, fans an inbox delivery toreviewer, and emitsmessage.createdin the same transaction (the response carriesevent_id,recipients,delivered_count). The reviewer receives it withinbox_pull(agent_id="reviewer")→inbox_ack(...)— no cursor, any workspace. List channels withchannel_list(onlygeneralis seeded — create others withchannel_create).event_wait(project_root, agent_id="reviewer", stream="workspace", cursor=0)→ the reviewer observesmessage.created(cursor-paginated, visibility-filtered, long-poll bounded bymax_wait_timeout_seconds).handoff_create(project_root, from_agent_id="builder", target={"strategy":"direct","agent_id":"reviewer"}, visibility="public")→ anOPENhandoff +handoff.createdon thehandoffstream.handoff_list_available(project_root, agent_id="reviewer")→handoff_claim(handoff_id, agent_id="reviewer")(atomic single-winner, lease TTL applied) →handoff_complete(handoff_id, agent_id="reviewer")→handoff.claimedthenhandoff.completed.artifact_put(project_root, artifact_type="text", content=…)(inline,<= 64 KB) andartifact_put(…, artifact_type="markdown", path="notes.md")(a workspace-contained path reference) → each emitsartifact.created;artifact_get(artifact_id)round-trips the inline content + metadata.shared_md_render(workspace_id)→ atomically (over)writes the derived, four-section view at{home}/workspaces/{workspace_id}/shared.md.
Throughout, the append-only events table accumulates session.opened,
message.created, handoff.created, handoff.claimed, handoff.completed, and
artifact.created with global, gapless, monotonic event_ids.
Run the live client with the venv interpreter:
.\.venv\Scripts\python.exe scripts\live_client.py
Testing
The suite has 650+ tests (pytest, testpaths=["tests"], pythonpath=["src"]).
.\.venv\Scripts\python.exe -m pytest -q
.venv/bin/python -m pytest -q
Notable coverage:
tests/test_import_boundary.py— AST-parses every module underdomain/andapplication/and fails if any imports thesqlite3ormcproot, keeping the hexagonal layering intact.tests/test_e2e_smoke.py— an in-process end-to-end smoke of the full coordination flow.tests/test_ports_contract.py— runs the same contract suite against the SQLiteEventRepoand the unit-test fake, so fake-drift (the M2 root cause) fails the build.tests/test_connection.py/tests/test_migrations.py— theBEGIN IMMEDIATEconcurrency contract, retryableDB_ERROR, and the per-migration transaction/ledger discipline.tests/test_tools_surface.py/tests/test_tool_schemas.py— the safe-by-default MCP surface (snapshot defaults,MIGRATEDshims,nexus_info, one error grammar) asserted at behaviour and schema level.tests/test_presence_trust.py— the single presence predicate,excluded_stale, session reap, and theopen/stricttrust contract.tests/test_retention.py— terminal-lane-only pruning invariants, bounded batches, dry-run, and theadmin pruneCLI.scripts/live_client.py— a real out-of-process MCP stdio client (not a unit test) that initializes, lists tools, and drives the whole flow against a freshly migrated temp store; a successfullist_toolsalready proves the store was migrated before any tool became callable.- Per-slice unit tests:
test_identity,test_events,test_messages,test_message_delivery,test_inbox_service,test_handoff,test_artifacts,test_routing,test_targets,test_shared_md,test_tail,test_config.
Project Layout
okto_labs_okto_nexus/
├─ pyproject.toml # package metadata, deps, console script, pytest config
├─ docs/design/
│ ├─ 0001-message-inbox-delivery.md # ADR: global inbox, lease/ack at-least-once
│ └─ 0002-design-review-hardening.md # ADR: this hardening (M1–M11)
├─ scripts/
│ └─ live_client.py # real MCP stdio client exercising the full flow
├─ src/okto_nexus/
│ ├─ config.py # NexusConfig + load_config (CLI > env > default, fail-closed)
│ ├─ errors.py # ErrorCode (18), OktoNexusError, retryable DB_ERROR helpers
│ ├─ envelope.py # ok()/err() + @tool_envelope + require_json_object_param
│ ├─ migrations/ # NNN_*.sql shipped inside the package (001..008)
│ ├─ domain/ # pure, stdlib-only
│ │ ├─ ids.py · routing.py · targets.py (single target grammar) · events.py
│ │ ├─ messages.py · inbox.py · handoff.py · artifacts.py · models.py · base.py
│ ├─ application/ # ports (Protocols) + use-case services
│ │ ├─ ports.py # incl. the Waiter port (blocking seam)
│ │ ├─ identity.py · events.py · messages.py · inbox.py
│ │ ├─ handoff.py · artifacts.py · shared_md.py · retention.py
│ └─ adapters/
│ ├─ inbound/
│ │ ├─ mcp/
│ │ │ ├─ server.py # FastMCP stdio server + fail-closed bootstrap + nexus_info
│ │ │ └─ tools/ # auto-discovered register(server, deps) modules
│ │ │ ├─ identity.py · events.py · messages.py · inbox.py
│ │ │ ├─ handoff.py · artifacts.py · shared_md.py
│ │ └─ cli/
│ │ ├─ tail.py # NDJSON event-log follower (okto-nexus tail)
│ │ └─ admin.py # operator maintenance (okto-nexus admin prune)
│ └─ outbound/
│ ├─ sqlite/ # connection (IMMEDIATE UoW), migrations runner, *_repo
│ ├─ waiter.py # SleepPollWaiter (data_version-gated long-poll)
│ ├─ file/store.py # workspace-contained file store (path safety)
│ ├─ sharedmd/renderer.py # atomic shared.md writer
│ └─ clock.py # SystemClock
└─ tests/ # 650+ tests incl. import boundary, contracts, e2e smoke
Limitations (Non-Goals)
- Single machine: coordination is bounded by one SQLite file on one host.
The HTTP hub can be bound to the LAN (
--host), but there is no cloud sync, clustering or multi-host replication. - Cooperative trust between agents: per-agent API keys gate every MCP/HTTP
surface, and
trust_mode=strictadditionally requires a per-sessionsession_secreton the sensitive verbs — but session secrets are stored in plaintext innexus.db, so trust is still bounded by who can read the file. No multi-tenant security, no RBAC (every valid key has the same powers; roles and scopes are part of the SaaS evolution). - No background threads: session staleness, presence, handoff/inbox lease
expiry, session reaping, and (opt-in) startup pruning are all
derived/opportunistic at access time — there is no scheduler. The serve's
only periodic work is the lock heartbeat and the SSE poll (1 s, bounded).
Retention defaults to off; run prune from the dashboard Settings screen or
okto-nexus admin prune(or enableauto_prune_on_start). - Not vendor-specific: any MCP host that speaks streamable HTTP or launches stdio servers works; there is no integration tied to a particular vendor.
- Inline content cap: 65536 UTF-8 bytes (inclusive, configurable); larger
payloads must be stored by
pathasartifact_type='file'.
Roadmap
Delivered since V1: streamable-HTTP transport with per-agent API keys, the web dashboard (live graph, kanban, chat timelines), REST observability, the SSE stream with exact resume, runtime settings with UI parity, and store reset/maintenance from the dashboard.
Next (non-binding):
- Push waiter: swap the
SleepPollWaiterfor an in-process notify waiter at the existingWaiterport — sub-poll latency forevent_waitand SSE. - Audit events for admin actions: close-session / cancel-handoff / reset performed from the dashboard should land in the event log.
- Richer streams & consumers: more event categories, finer filtering.
- Tasks surface: the
taskstable exists in the schema; expose task create/transition tools and wire them into handoffs. - Reserved handoff states: activate
IN_PROGRESS/BLOCKED/EXPIREDwith explicit producers. - Optional background reaper: proactive lease expiry and continuous retention as an opt-in daemon.
- SaaS evolution: multi-tenant auth (RBAC/scopes), hosted storage and
transport behind the existing hexagonal ports (
AuthProvider,ObservabilityQueries, storage,Waiter).
License
Elastic License 2.0 + SaaS/Branding Addendum + Trademark Policy —
Copyright 2026 Okto Labs. See the LICENSE file for the full text (also
served by the running hub at GET /api/v1/license and shown in the
dashboard's About dialog). In short: free to use, copy, modify and
redistribute, but you may not offer Okto Nexus as a hosted/managed service or
remove licensing/branding notices.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file okto_nexus-0.0.1.tar.gz.
File metadata
- Download URL: okto_nexus-0.0.1.tar.gz
- Upload date:
- Size: 644.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eee9f58ea7c9b48bbba2f3bebc7e8c4e16330b6b339b4f5c5764d9c74a89993f
|
|
| MD5 |
6ba8df3cb57c0f16ab918e29a59d6a95
|
|
| BLAKE2b-256 |
999cd687ed4f653d030c9adec3fe6eeea313e79f3b567ee5cad28f3d30f86207
|
File details
Details for the file okto_nexus-0.0.1-py3-none-any.whl.
File metadata
- Download URL: okto_nexus-0.0.1-py3-none-any.whl
- Upload date:
- Size: 485.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
488ed204ea0c0515c7f46f89d16993139c82836e02f3e5b6199f5503c4720436
|
|
| MD5 |
533164c6e24a99f90b6898241daf360e
|
|
| BLAKE2b-256 |
f54ac3ed23acc9472bfdc73668df472558fe44d19cb5c1eb0ed847436db95488
|