Skip to main content

An identity-native multi-agent system that runs the full software development lifecycle.

Project description

Wonderland

An identity-native multi-agent system that runs the full software development lifecycle — discovery → planning → design → implementation → verification — as a cast of characters who carry the project across sessions.

Generic AI agents perform roles. Identity-native agents inhabit them.

Wonderland dashboard

▶ See it in action

Mvp-demo2 — first end-to-end Tier 2 autonomous pilot. A full-stack markdown notebook produced autonomously: discovery → milestone-plan → 3 design+implement+verify cycles. $83.78 total spend, 5000 lines of code (~3400 application + ~1600 test), 61 passing tests, SQL-injection-safe handlers, custom React hooks, modular components. The operator was a gate-approver — pressed buttons at milestone-close gates, skipped a few duplicate features, watched the team build. Zero killed runs, zero hand-edited artifacts, one mid-pilot substrate fix.

If you're trying to figure out what this project actually is, start there. (For the earlier single-directive demo of one-shot code generation, see the Geocities pilot.)

Wonderland is a cast of agents — each named after an Alice-in-Wonderland character — that collaborate across the entire software production pipeline. The Cheshire Cat is the architect. The White Rabbit is the project manager. The Mad Hatter is QA. Alice inhabits personas to write stories the team designs against. The Tweedles ship code; the Caterpillar reviews it. Every character has a stable self-model (a "constitution"), persistent per-agent memory, and a working relationship with the others.

The earlier framing was "multi-agent development system" — a fair description when the entry surface was directive in → features out. The project has since grown into something closer to an end-to-end emulation of how a small software team actually produces shipping product: a discovery phase that interviews the operator about personas and scope, a planning phase that organizes the captured requirements into sequenced milestones, then per-milestone design and implementation passes that close coverage loops the substrate verifies. The Wonderland-flavored part is unchanged — identity still does the work — but the surface the operator interacts with is now the flow, not the meeting.

The architectural claim is that identity does real work. An agent with a constitution it inhabits across many threads behaves differently from an agent reconstructed from a system prompt each turn. It accumulates judgment. It develops calibrated views of its colleagues. It refuses to cross domain boundaries because the boundary is part of who it is, not a policy applied from outside. Whether that compounds into measurably better outcomes than a generic-agents baseline is what the eval harness in P7 will measure; until then, the analyses/ directory tracks the qualitative observations as the system gets built out.

Five corollaries follow from this, each anchored in field analyses:

  1. Identity-based architecture lets smaller models outperform their expected capabilities. The constitution carries the discipline a generic agent would have to invent turn-by-turn, so a small model acting in character can hold its own against a large model with a generic prompt. Default target is Claude Haiku 4.5; the falsifier is P7's generic-baseline-vs-identity-native eval. (analysis 004)
  2. Failure modes are part of identity. Each constitution's §VIII names the specific shadow each virtue decays into — the Sephirah/Qlipha pairing, where the corruption is structural rather than additive. Agents course-correct from inside instead of waiting for an external guardrail. (analysis 004)
  3. Character-shaped agents degrade visibly rather than silently when parts of the system fail. When the bus dropped feature-composition artifacts mid-run, the Tweedles noticed the contradiction and reached for the disk-resident artifacts via their tools — recovery was emergent, not designed. (analysis 027)
  4. The team produces a small-team shape, including things the directive never asked for — ADRs with named tradeoffs, persona-grounded specs, accessibility coverage that wasn't requested. Production shape as a derived property of constitutional grounding. (analyses 034, 035)
  5. Friction is the substrate, not the inefficiency. Every meeting in the workflow is engineered friction with a specific shape; §VIII puts that friction inside each constitution. Generic stacks have nothing analogous because they have roles, not characters; goals, not voices; consensus, not constitutions.

Full argument with the per-corollary mechanics in THESIS.md.

The framing the project is building around: failures are how software gets built. The iterative cycle of ship-then-discover-then-fix depends on recognizing what went wrong; agents whose failure modes are part of their identity can participate in that cycle as colleagues, not as tools that need supervising out of their bad habits.

How it works end-to-end

The operator's flow through Wonderland mirrors the four-phase arc of an actual software project, with each phase grounded in the previous one's artifacts:

  1. Discovery — Alice, Cheshire Cat, and White Rabbit each run a short interview (personas, technical constraints, scope + success criteria). The substrate writes the answers to disk as structured requirement artifacts. The whole loop is ~12 minutes of operator attention; every later workflow seeds from this corpus rather than re-asking what the project is for.
  2. Planning — the milestone-plan workflow organizes the captured requirements into 3-7 ordered milestones, each declaring consumes_requirements + done_when criteria. A substrate-level coverage check runs at end of each rotation: any decomposable requirement not assigned to a milestone fires a synthetic Dodo observation nudging the team to revise.
  3. Design (per milestone)tdd-design --milestone <slug> composes stories from the milestone's requirements (M1), turns them into features (M2), decomposes features into tickets (M3), negotiates architecture (M4) and per-feature contracts (M5). M2 runs its own coverage check verifying every milestone requirement is realized by a feature before the meeting closes.
  4. Implementation (per ticket)tdd-implement opens per-ticket meetings where the Tweedles write red tests, ship code that turns them green, and the Caterpillar reviews the diff. The working tree IS the implementation artifact; review is against git diff, not a parallel metadata utterance.

Cross-cutting through all four phases: agents can retract artifacts they shipped earlier when they realize they drifted off-scope. A workflow-level kill-list blocks speech_acts from leaking between phases (e.g., a stray milestone_plan utterance during tdd-design is silently stripped). The project dashboard derives the current phase from disk and surfaces the next-recommended workflow as a one-line CTA.

See projects/discovery2/.wonderland (or any of the discovery* projects) for the actual artifact shapes the team produces.

Status

In-progress, building in public. The vertical slice through discovery → planning → design lands in 0.6.0; implementation through a full milestone is the next demo target. P7 (evals) will close out the phase numbering once the lifecycle stabilizes.

  • P1 — Foundation (overview) Schema, Caucus event bus, episodic memory, agent base class.
  • P2 — First Voice (analysis 001) The Cheshire Cat comes online and produces ADRs in his own voice.
  • P3 — First Tension (analysis 002) White Rabbit joins the bus. Semantic + relational memory layers. Compaction-as-agent-behavior — agents reflect between threads.
  • P4 — First Race (analysis 003 · analysis 004) Alice agent. Dodo orchestrator with quiescence detection. Conflict resolution, composition, and human-in-the-loop escalation. Showcase 1: a /health endpoint directive runs end-to-end against live Haiku 4.5, reaching settlement because the team goes quiet — no human in the loop.
  • P5 — Full Cast The remaining six characters land: Mad Hatter (test scenarios), Caterpillar (code review), Queen of Hearts (security/compliance rulings), Dormouse (production observations), and the Tweedles (frontend + backend implementation, with a shared pair protocol). wonderland init CLI ships the .wonderland/ skeleton. Synthetic- consensus guard observes the bus for the §11 anti-pattern — three or more distinct constitutional domains converging on the same position is suspect, and the guard surfaces it for review.
  • P6 — Real Threads The hard showcases plus the substrate maturation that made them work. Showcase 2: translation chat MVP (analyses 015018) — 1580 lines shipped from a vague directive in $0.93. Showcase 3: security recovery (analysis 019) — reactive response to a synthesized credential-stuffing incident; the framework's first non-greenfield run. Showcase 4: multi-session persistence (analyses 020023) — memory compounding across sessions, plus the substrate fixes that closed the "Tweedles don't ship" bug class (turn-based quiescence replacing wall-clock, parse-retry recovering malformed LLM responses, working-tree-as-implementation-artifact). Workflow-as- data extraction (analysis 024): meeting chains live in closet/workflows/ as YAML; the TDD variant becomes the recommended default for feature work, with canonical retained for fast iteration when directive drift is unlikely. The pair protocol pays off in analysis 025: Alice and the Mad Hatter are paired in M4 (the tea party — they're already paired in the source material) so Alice writes user-journey scenarios and Hatter writes failure-mode scenarios, each producing both the artifact and a runnable pytest file. The Geocities showcase ships 1841 lines from a vague directive in 7m 38s for $2.05, with Caterpillar catching three production-blocking bugs in the diff — the framework's bug-discovery surface working as designed. See SHOWCASE.md for the friend-facing tour and analyses/ for the full build log of the iteration that got the framework here.
  • P8.1 — Observer API & P8.2 — TUI Inspector (0.0.1 release) HistoricalRunHandle reads any snapshot directory; the Textual TUI ships as a read-only run inspector with lazygit-style multi-pane drill-down (snapshot library → run summary → meeting detail → utterance modal → artifact browser, plus Cast view + theme cycling).
  • P8.3 — Streaming + Mock Turtle RunHandle.stream_events() async-iterator interface + MockTurtleHandle that replays a snapshot at compressed clock time. The testbed P8.4's live-watch screen iterates against without API spend.
  • P8.4 — Live-watch screen The streaming surface goes visual: meetings ribbon fills in as MeetingStarted events arrive, transcript scrolls live, body preview tracks the utterance cursor, artifacts pane filters per-meeting. All three panes focusable, Tab cycles, lazygit principle throughout. Iterated entirely against Mock Turtle replay.
  • P8.5 — Directive issuing + LiveRunHandle (this release) NewRunScreen is the directive composer (preset picker with bundled pomodoro / hello-endpoint / translation-chat / geocities / ping directives, plus per-project saves to .wonderland/directives/); selection drives composer + workflow pre-fill; Enter steps through the form like a paper form. LaunchConfirmationScreen guards the irreversible spend with directive preview + soft-cap budget. LiveRunHandle wraps a real Runner+Caucus and emits through the same streaming protocol Mock Turtle uses, so the live-watch screen consumes a real run interchangeably with a replay. Settings screen accepts API key + model from inside the TUI — fresh pip install users no longer drop to the shell to write a config file. First end-to-end TUI run shipped a story for $0.0119 against the smoke workflow.
  • P8.6 — New-project spinup Skeleton picker + stack-detection routine for adopting existing non-Wonderland projects. Closes the on-ramp from "I have a project idea" to "team's ready to start" without leaving the TUI.
  • P8.7 — First-use polish Welcome screen, error states, abort flow, quiescence indicator.
  • P11 — Projects: first-class project model + dashboard Project registered in a JSON registry at ~/.wonderland/projects.json; per-project dashboard becomes the operator's primary attention surface. project.yaml carries stack-as-runtime-fact the team consults at every meeting.
  • P12 — Feature lifecycle + workflow composability Split design / implement / verify into separate workflow atoms. Per-feature lifecycle states (proposed → in_design → designed → queued → in_progress → ready_review → verified). Operator batches: queue several features in designed state, run tdd-implement against the queue.
  • P13 — Cross-run stability Iterating tdd-design + tdd-implement against the same project becomes the expected mode of operation. Seed fallback reads prior runs' artifacts off disk; emission transitions fire per-utterance so dashboard state tracks reality without backfill races.
  • P14 — Discovery Requirements gathering as a first-class pre-design phase. Three-interview flow (Alice → personas / situations, Cheshire Cat → constraints / integrations, White Rabbit → scope / success criteria). Disk-mediated operator bridge so interviews don't burn rotation budget while operators think. Wall-clock unbounded.
  • P15 — Milestones milestone-plan workflow organizes captured requirements into a sequenced milestone trajectory. tdd-design --milestone <slug> scopes design to one milestone. Two closed-loop coverage checks: (a) every decomposable requirement lands in a milestone, (b) every milestone requirement is realized by a feature. Retract primitive lets agents walk back off-scope artifacts; workflow-level kill-list blocks cross-workflow speech_act leakage. Dashboard surfaces the derived lifecycle phase + the next-recommended workflow as a one-line CTA.
  • P7 — Evals Generic-baseline vs Wonderland comparison. The compounding curve. Reordered after the lifecycle stabilizes so the eval harness has a usable end-to-end surface.

WONDERLAND_SPEC.md is the design document. constitutions/ holds each character's identity in plain text — these are the actual identity artifacts the runtime loads.

Try it

Two demo scripts run live against the Anthropic API. You'll need an API key (see Configuration below).

# A single Cheshire Cat reflecting on a directive
uv run python scripts/cat_demo.py

# Cat + Rabbit on the same bus, with optional compaction afterward
uv run python scripts/two_agent_demo.py --compact

Both scripts publish a translation-chat directive by default; pass --directive "..." to use your own.

The TUI

wonderland-tui is the operator interface. Register a project, queue features for the team, watch them work in real time, verify or reject what they ship. The same screen that renders live runs also replays past ones at compressed clock time, so iterating on the UX never costs API tokens.

pip install wonderland-ai
wonderland-tui                       # opens the project library

First-run flow: the library opens empty. Open Settings, paste an Anthropic API key (saved to your platform's user-config dir), back out. Press n to create a project — pick a path, pick a skeleton (python-tui, python-cli, python-fastapi, react-vite, fullstack-fastapi-react), optionally seed the project's prime directive from a demo preset (squathero, pomodoro, geocities, …). On confirm, the TUI offers to launch the discovery workflow immediately — yes lands you on the live-watch screen where the first interview modal pops up; later just opens the project dashboard. project.yaml carrying the stack-as-runtime-fact gets written either way so the team has runtime context at every meeting.

After discovery completes, the dashboard's phase badge reads PLANNING — N requirements captured with a "Run milestone-plan" button right above the project context. After planning, the badge reads DESIGN — Milestone N (M of K designed) with a "Design milestone: " button at the bottom of the milestone detail pane. The substrate computes the phase from disk every refresh, so an operator returning to a project after a week sees what's next without re-orienting.

The screens, in the order an operator typically meets them:

  • Project library — your projects with metadata. n for new, Enter to open the dashboard, s for settings.
  • New project — name, path, prime directive (with a demo picker that populates the composer from bundled directives), skeleton picker. On submit the TUI offers to jump straight into the discovery workflow — the natural first move on a fresh project. Skeleton apply lays down a working scaffold AND writes project.yaml so M4 architecture and M5 contracts ground in the runtime fact, not just the directive's prose. Existing non-bare projects get a retrofit path that writes project.yaml without clobbering existing files.
  • Per-project dashboard — the operator's primary attention surface. Phase badge at the top derives the project's current state (DISCOVERY / PLANNING / DESIGN — Milestone N / IMPLEMENTATION / COMPLETE) from disk + names the next recommended workflow. Below: Milestones pane (left) with a collapsible tree of milestones + their consumed requirements + a Cross-cutting node for personas / situations / out-of-scope reqs that don't belong to any specific milestone. Selecting a milestone filters the Features pane (right) to features that realize that milestone's requirements via the story → feature chain. Feature pane has state filter chips (designed / queued / ready_review / in_progress / verified / rejected) that stack with the milestone filter. Runs row sits below as the always-visible history reference. State-aware action buttons (Design, Implement, Verify, Custom run) surface counts for what's actionable.
  • Lifecycle moves from the dashboardq queues a designed feature for implementation; Verify opens a modal that captures the operator's verdict with optional notes (verified / rejected → recorded in .wonderland/feature-states.jsonl for next-run context); m/D mark and bulk-delete duplicate tickets when Rabbit's M3 ships revision-pass redundancy. The Milestone detail pane offers a "Design milestone: " CTA at the bottom when the highlighted milestone has zero realizing features yet — operator-driven path into tdd-design with the scope pre-filled.
  • New run composer — preset picker (left) + directive editor (right) + workflow / budget / project-root config + inline save-as-preset form. Bundled directives: pomodoro, hello-endpoint, translation-chat, geocities, ping. Per-project presets live at <project>/.wonderland/directives/. Empty directives push a confirmation modal so a launch doesn't silently ship without intent.
  • Live-watch screen — three focusable panes (lazygit-style): meetings ribbon (with per_item iteration discriminators for parallel and pipeline workflows), transcript table + body preview pane, artifacts table. Selection filters across panes; status bar shows current speaker, live cost ticker, watching elapsed time + source-time elapsed. Same screen consumes live runs (LiveRunHandle) and replays of captured runs (MockTurtleHandle) interchangeably.
  • Operator-question modal — when an agent emits a question_to_operator (architectural ambiguity contracts can't disambiguate, business priority calls, schema-vs-directive conflicts), the framework pauses the meeting and surfaces the question as a modal. Your reply lands on the bus as an observation from the operator identity; the meeting resumes with the team seeing the answer in their context.
  • Interview modal — surfaced during discovery workflow when Alice / Cheshire Cat / White Rabbit ship their question batches. One widget per question kind (free-text TextArea, single-choice RadioSet, multi-choice Checkbox group, numeric Input). Submit / Skip section / Cancel exits; required fields block submit until filled. The substrate writes answers to .wonderland/runs/<id>/pending_interview_answers.json and the interviewer's next turn synthesizes them into requirement artifacts. Wall-clock unbounded — interviews don't burn rotation budget while the operator's away.
  • Cast view — single-page lazygit shape: character list at top, bio + constitution side-by-side below. Bios cover both the literary character and how it shapes each agent's constitution. Useful for understanding why an agent made a particular call when reviewing a captured run.
  • Settings — Anthropic API key (password-masked, persists to the user-config dir) + optional model override. Reachable from the library, also auto-pushed when New run finds no API key set so fresh pip install users have a one-click recovery path.
  • Theme cyclingt rotates through four Wonderland-flavored palettes (Tea Party / Looking Glass / Trial / Caucus); built-in Textual themes (gruvbox, dracula, nord, …) remain available.
  • Vim navigation throughout — j/k to move, g/G and H/L for top/bottom, Enter to drill in / advance, Tab to cycle focus across panes, Escape to back out. Per-screen bindings show in the footer.

The replay-first design carries forward: drives the smoke tests, keeps UX iteration free of API spend, and means anyone curious about how the framework actually behaves can wonderland-tui → open a project → drill into Runs → press w on a snapshot to watch a captured run play back at 5× speed. Project context, the features-as-tree dashboard, the verify/reject modal, and the operator-question pipeline are the P11/P12 additions that pulled the framework from "watch a run happen" to "drive a project's feature lifecycle through several runs."

Project layout

wonderland-ai/
├── WONDERLAND_SPEC.md      # The design document
├── THESIS.md               # Long-form thesis (architectural claim + corollaries)
├── constitutions/          # Each character's identity, version-controlled
├── src/wonderland/         # The runtime
│   ├── closet/             # Data the team reaches for at runtime
│   │   ├── skeletons/      # Project skeletons the team builds on top of
│   │   └── workflows/      # Meeting-chain templates (canonical, tdd, smoke)
│   └── ...                 # agent.py, runner.py, caucus.py, workflow.py, ...
├── scripts/                # Demo scripts; workflow_demo.py runs any bundled workflow
├── analyses/               # Field notes on the thesis as it gets stress-tested
├── tests/
└── .daedalus/              # Daedalus' working memory for this project

A target project that runs Wonderland gets a .wonderland/ directory of its own — per-agent episodic/semantic/relational memory plus the artifacts the team produces across the lifecycle: requirements (from discovery), milestones (from planning), stories, features, tickets, ADRs, contract notes, test scenarios, implementations, reviews. The runtime here is project-agnostic; per-project state lives with the project.

wonderland init [path]   # create the .wonderland/ skeleton; idempotent

init creates requirements/, milestones/, stories/, features/, tickets/, architecture/, escalations/, and memory/ plus a README documenting the layout. Re-running is safe — existing artifacts and a user-edited README are left alone.

Install

Distribution name on PyPI is wonderland-ai; the import path stays import wonderland. Core install includes the TUI (the primary user-facing surface) and the in-process bus:

pip install wonderland-ai           # core + TUI
pip install 'wonderland-ai[redis]'  # adds RedisCaucus

RedisCaucus requires the redis extra; constructing one without it raises ImportError with an install hint.

Configuration

Wonderland reads user-level config (API keys, model overrides) from a JSON file at the platform-appropriate location:

OS Path
Linux ~/.config/wonderland/config.json (honors XDG_CONFIG_HOME)
macOS ~/Library/Application Support/wonderland/config.json
Windows %APPDATA%\wonderland\config.json
{
  "anthropic": {
    "api_key": "sk-ant-...",
    "model": "claude-haiku-4-5-20251001"
  }
}

API-key resolution order: explicit constructor arg → ANTHROPIC_API_KEY env var → config file. The env var wins if set.

Development

uv sync --extra dev   # includes redis for full test coverage
uv run pytest
uv run ruff check
uv run ruff format

Live LLM tests are gated behind WONDERLAND_LLM_SMOKE=1 and skipped otherwise; running them costs Anthropic API tokens. Redis-backed tests are gated behind WONDERLAND_REDIS_URL. To exercise both:

docker run -d --name wonderland-redis -p 6379:6379 redis:7-alpine
WONDERLAND_REDIS_URL=redis://localhost:6379 \
WONDERLAND_LLM_SMOKE=1 \
  uv run pytest

Sponsoring

Wonderland runs on a personal Anthropic budget — one person, one API key. The architecture is designed to be cheap (small models, heavy caching) but multi-agent runs at scale still add up. If any of my work has been useful to you — to read, build on, or argue with — GitHub Sponsors keeps the Cheshire Cat in tea and the Hatter in scenarios.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wonderland_ai-0.8.3.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wonderland_ai-0.8.3-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file wonderland_ai-0.8.3.tar.gz.

File metadata

  • Download URL: wonderland_ai-0.8.3.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for wonderland_ai-0.8.3.tar.gz
Algorithm Hash digest
SHA256 fb6cebd92bccde06684dfa204c8d293bc3e842dd513763a75b33e6bc9845e01a
MD5 07c9fa13701cbe1b293d8d2cb6e99c79
BLAKE2b-256 02427aea4cf20a1885f046d7809bf86764fb790df936b8ba59581874d476acf3

See more details on using hashes here.

File details

Details for the file wonderland_ai-0.8.3-py3-none-any.whl.

File metadata

  • Download URL: wonderland_ai-0.8.3-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for wonderland_ai-0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 496731a289a45a0d103446a277a8a887ccd994d759296bc1eb284873da873506
MD5 d4c6f0206ee1ecea2163d1570255b4a6
BLAKE2b-256 8734316621c408244bceff92453a5f0e3c8386f99ba30f3706fa8d2054e34193

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page