Skip to main content

Agent execution harness — wraps LLMs in structured, inspectable workflow specs

Project description

Armature

CI License: MIT Python 3.11+

A lightweight, declarative agent execution harness. Define multi-agent workflows as YAML specs. Run them with a single Python call or from the CLI.

No framework dependency. No prescribed team structure. Just a DAG executor, an LLM adapter, and your workflow spec.

Armature is the execution engine for Reasoning Automation — end-to-end business processes where multi-agent deliberation replaces brittle rule-based logic. The harness owns orchestration, retries, safety, telemetry, and human approval gates. You supply the domain logic as YAML workflow specs and Python tool modules. The same engine that runs a code-review pipeline can run a contract risk assessment, a social media creative chain, or a compliance audit — without any changes to Armature itself.

Part of a larger vision. Armature is one component of a broader autonomous-organization platform I'm building under the working name ElfTech — a stack of AI systems covering reasoning, deliberation, code generation, deployment, and coordination. More details to come. Stay tuned.


What it does

Armature reads a YAML spec that defines a workflow as a directed acyclic graph (DAG) of stages. Each stage is one of four things:

  • An LLM call — a role with a system prompt, model tier, and output format
  • A script/adapter — a Python function or shell command
  • A human gate — pauses execution for human approval
  • A direct tool call — invokes a registered tool deterministically, no LLM involved
  • A subagent — spawns a child workflow (with optional fan-out/fan-in for parallelism)

Stages declare depends_on relationships. The engine resolves execution order automatically, passes accumulated results downstream as context, and handles retries, safety hooks, and telemetry.


Installation

pip install armature-agents

With optional extras:

pip install "armature-agents[service]"   # FastAPI HTTP service
pip install "armature-agents[telemetry]" # OpenTelemetry export

Verify:

armature --version

Set your LLM provider key:

export ANTHROPIC_API_KEY=sk-...
# or OPENAI_API_KEY, or configure any litellm-supported provider

Quick start

1. Write a spec (my_workflow.yml):

name: summarize
version: "1.0"

model_tiers:
  small:
    provider: anthropic
    model: claude-haiku-4-5-20251001

# Optional: map role types to tiers so stages don't need explicit model_tier
role_type_defaults:
  worker: small
  judge: small

stages:
  - id: summarizer
    role:
      name: Summarizer
      type: worker        # picks up "small" from role_type_defaults
      description: |
        Summarize the provided text in 3 bullet points.
        Be concise and capture the key ideas.
    output_mode: text
    depends_on: []

2. Run it from Python:

import asyncio
from armature import Harness

async def main():
    harness = Harness.from_spec("my_workflow.yml")
    result = await harness.run({"text": "Your content here..."})
    print(result["summarizer"]["content"])

asyncio.run(main())

3. Or from the CLI:

armature run my_workflow.yml --input text="Your content here..."

CLI

armature run <spec>                           # execute a workflow
armature run <spec> --no-cache               # run without LLM response cache
armature run <spec> --auto-improve           # run then auto-apply spec improvements when IHR < 0.75
armature validate <spec>                      # validate spec + show KYA-inspired risk score (LOW/MEDIUM/HIGH/CRITICAL)
armature new [output]                         # interactive spec creation wizard
armature doctor                               # environment health check
armature serve                                # start HTTP service (requires armature[service])
armature serve --specs-dir ./specs/          # serve with named workflow registry (/workflows API)
armature optimize <spec>                      # single-shot meta-harness optimizer
armature improve <spec>                       # analyze traces, auto-apply spec improvements
armature improve <spec> --apply-pending       # promote a staged pending.yaml revision
armature report --run-id <id>                 # per-run text report with failure signatures
armature replay <run_id>                      # display a recorded run stage-by-stage
armature dashboard <spec>                     # Rich 4-panel aggregate health dashboard
armature dashboard <spec> --watch             # auto-refresh every 5 seconds
armature dashboard <spec> --format json       # machine-readable JSON output
armature export-traces                        # export traces as SFT/DPO training data
armature channels start                       # messaging channel connectors
armature watch <spec>                         # listen for cron/webhook triggers and fire runs

Built-in tools

Armature ships with a tool registry pre-loaded with the following tools. Any stage can invoke them via tool_call or by listing them in role.tools.

Tool name Permission Description
file_read READ_ONLY Read a file from disk
file_write WORKSPACE Write content to a file
shell WORKSPACE Run a shell command; returns stdout, stderr, exit_code
http_get NETWORK HTTP GET request; returns status and body
http_post NETWORK Authenticated HTTP POST with JSON body and custom headers; returns status and body

http_post is the general-purpose adapter for any external API — image generation, ad platforms, analytics services, webhooks, etc. Pass auth credentials in headers:

- id: generate_image
  tool_call:
    name: http_post
    args:
      url: "https://api.openai.com/v1/images/generations"
      headers:
        Authorization: "Bearer {{ env.OPENAI_API_KEY }}"
        Content-Type: "application/json"
      body:
        model: "dall-e-3"
        prompt: "{{ visual_prompt }}"
        size: "1024x1024"
        n: 1

Reasoning Automation

Armature's tools: spec section lets any workflow load external Python modules that register additional tools. This is the primary extension point for building Reasoning Automation applications — end-to-end processes that connect LLM reasoning to real external systems.

The pattern

Create a Python package alongside your workflows. Each module exposes a register(registry) function:

# myapp/tools/dalle.py
import openai
from armature.registry.registry import ToolRegistry, ToolDescriptor, PermissionLevel

_client = openai.AsyncOpenAI()

async def generate_image(args: dict) -> dict:
    response = await _client.images.generate(
        model="dall-e-3",
        prompt=args["prompt"],
        size=args.get("size", "1024x1024"),
        n=1,
    )
    return {"url": response.data[0].url, "revised_prompt": response.data[0].revised_prompt}

def register(registry: ToolRegistry) -> None:
    registry.register(ToolDescriptor(
        name="dalle.generate_image",
        description="Generate an image using DALL-E 3",
        permission=PermissionLevel.NETWORK,
        handler=generate_image,
        parameters={
            "prompt": {"type": "string"},
            "size":   {"type": "string", "optional": True},
        },
    ))

Declare it in your workflow spec:

tools:
  - module: myapp.tools.dalle
  - module: myapp.tools.meta_publisher
  - module: myapp.tools.analytics

stages:
  - id: generate_image
    tool_call:
      name: dalle.generate_image
      args:
        prompt: "{{ visual_director.prompt_a }}"

The tool modules live entirely in your application project. Armature imports them at startup. No changes to Armature are required.

What you can build

Use case Tool modules needed
Social ad campaign automation Image gen (DALL-E 3), platform publishers (Meta, TikTok), analytics collectors
Contract risk review Document extractor, clause classifier, risk scorer
Vendor assessment Web search, company lookup, scoring rubric
Compliance documentation Regulatory corpus retrieval, template filler, diff checker
Code review pipeline GitHub API, static analysis runner, security scanner

Each use case is a YAML workflow spec + a small set of Python tool modules. The Armature engine is the shared execution layer across all of them.


Research foundation

Armature is built from nine academic papers, one industry governance framework, and one open-source agent architecture project, all published between February and June 2026. Every major design decision traces to an experimentally validated finding: the harness matters more than the model.

The papers

[NLAH] Natural-Language Agent Harnesses — Tsinghua University, March 2026 (arXiv:2603.25723)

Establishes the architectural model. NLAH defines seven mandatory harness components (Contracts, Roles, Stages, Adapters, State, Failure Taxonomy, File-backed State) and shows that workflows defined in structured natural language outperform code-based equivalents on complex benchmark tasks (47.2% vs. 30.4% on OSWorld). It also defines IHR (Implicit Harness Rating), a composite quality metric for scoring run quality objectively, and specifies parallel fan-out as a core orchestration primitive.

[Meta-Harness] Automated Optimization End-to-End — Stanford University, March 2026 (arXiv:2603.28052)

The paper behind the optimizer. Meta-Harness introduces an outer optimization loop where a frontier model reads execution traces and proposes improvements to the harness spec itself. Key finding: giving the optimizer access to the history of prior proposals — what was tried, whether it was accepted, and what score it achieved — improves accuracy from 41% to 57% by enabling causal reasoning. Implemented in ProposalStore and run_loop().

[AutoHarness] LLM-Synthesized Harnesses — February 2026 (arXiv:2603.03329)

Demonstrates that LLMs can iteratively write their own harness code and produce systems that outperform larger models without harnesses. The concept most directly applied: the harness-as-verifier, where the harness validates outputs meet domain-specific legality constraints before accepting them — the ancestor of the judge role type and SpecDrafter.

[AgentSpec] Runtime Enforcement for Safe Agents — March 2025 (arXiv:2503.18666)

Introduces a declarative rule language for constraining agent behavior at runtime. Rules are composable, lightweight (sub-millisecond evaluation), and LLM-generatable. Armature implements the full enforcement architecture: pre/post-tool hooks wired into the engine and a declarative condition DSL (ToolSafetyRule + SafetyCondition) written directly in YAML.

[Continual Harness] Reset-Free Self-Improvement — May 2026 (arXiv:2605.09998)

Formalizes the two-loop self-improvement design: an inner loop (a post_run refiner stage that sees the full transcript after the DAG completes) and an outer loop (SelfImproveRunner — load traces → diagnose → propose YAML revision → auto-apply). Introduces the 4-code failure taxonomy (stage_failed, output_invalid, low_confidence, high_escalation) and the fine-tuning bridge: high-quality judge traces exported as SFT/DPO training data.

[AHE] Agentic Harness Engineering — April 2026 (arXiv:2604.25850)

The accountability paper. AHE introduces the prediction-verification loop: every proposed spec revision carries a falsifiable contract (predicted_fixes, predicted_regressions), and the next cycle verifies those predictions against observed diagnostic shift. Implements component-level improvement targeting — long-term memory evolution alone yielded +5.6pp; system prompt evolution alone caused -2.3pp regression, validating the "one component at a time" discipline.

[System Scaling] From Model Scaling to System Scaling — May 2026 (arXiv:2605.26112)

Identifies three system-level failure modes: stale memory reaching LLMs without warning, context values flowing between stages without provenance, and tool side effects going unverified. Adds drift score (regression detection across improvement cycles) and component governance (auto-apply vs. human-review classification for spec changes).

[AGT] Microsoft Agent Governance Toolkit — 2025

Five governance primitives borrowed directly: reversibility classification for every tool call (FULL / PARTIAL / NONE), tamper-evident SHA-256 hashing of trace inputs and the governing policy, a require_approval gate wired into the tool-call path, and safety_mode: strict (fail-closed — deny on no-match).

[ActiveGraph] — yoheinakajima, May 2026 (arXiv:2605.21997)

Graph-memory agent architecture introducing content-addressed caching of LLM responses and event-triggered reactive behaviors. Adopted concepts: SHA-256 cache keying by model + messages + kwargs (LLMCache), audit replay from the trace store (armature replay), and the BehaviorRule/BehaviorRegistry hook layer for pattern-triggered post-run behaviors.

[KYA] Know Your Agents — Veldt Labs, May 2026 (arXiv:2605.25376)

Governance layer operating at definition-time (static risk scoring), runtime-trust (anomaly counting), and composition (only-tighten). Adopted: five-factor static spec risk score surfaced by armature validate, RogueSignalCounter wired into safety hooks and the run summary, and CONFLICTING_SAFETY_RULES validation enforcing the only-tighten composition principle.


What's implemented

Source Concept Status
NLAH 7-component spec, four role types, IHR, fan-out/fan-in
Meta-Harness Single-shot + multi-iteration optimizer, proposal history, prompt bootstrapping
AutoHarness Harness-as-verifier, NL-to-spec synthesis (SpecDrafter), AutoHarness loop
AgentSpec Pre/post-tool hooks, declarative safety DSL (6 operators, 5 actions)
Continual Harness 4-code failure taxonomy, inner refiner loop, SelfImproveRunner, TraceExporter
Harness Benefit (arXiv:2605.30621v1) Cheap-evolver (medium-tier SpecRefiner), HFR as 5th IHR component, SLR low_skill_activation diagnostic
AHE Falsifiable improvement contract, prediction-verification, _verify_predictions()
System Scaling Memory staleness, context provenance, drift score, postcondition verification, consensus fan-in, component governance
AGT Reversibility classification, trace hashing, policy version, require_approval, strict mode
ActiveGraph LLM response caching, audit replay, trace-triggered behaviors (BehaviorRule), --auto-improve
KYA Static spec risk score, rogue signal counter, only-tighten safety rule validation

The self-improvement flywheel

Armature is the execution layer — the first component in a larger system designed to improve itself the more it runs. The chart below shows where the current implementation stands and where the flywheel leads aspirationally.

  TODAY                         NEAR-TERM                    ASPIRATIONAL
  ─────────────────────────────────────────────────────────────────────────

  ┌──────────────────┐
  │  Armature        │  ─── every run records ──►  ┌─────────────────────┐
  │  Harness         │                              │  TraceStore         │
  │                  │  ◄── optimizer proposes ───  │  (SQLite, per run)  │
  │  • DAG executor  │        spec improvements     └──────────┬──────────┘
  │  • Role routing  │                                         │
  │  • Safety hooks  │                              ┌──────────▼──────────┐
  │  • IHR scoring   │                              │  Loop 1:            │
  │  • Session log   │                              │  Harness Optimizer  │
  └──────────────────┘                              │                     │
                                                    │  Reads traces +     │
                                                    │  proposal history   │
                                                    │  → proposes YAML    │
                                                    │  spec improvements  │
                                                    │  → A/B tests by IHR │
                                                    └──────────┬──────────┘
                                                               │ accepted diffs
                                                    ┌──────────▼──────────┐
                                                    │  Loop 2:            │
                                                    │  SLM Fine-Tuning    │
                                                    │                     │
                                                    │  High-quality       │
                                                    │  traces → LoRA      │
                                                    │  fine-tune workers  │
                                                    │  → register as      │
                                                    │  new model tier     │
                                                    └──────────┬──────────┘
                                                               │ better workers
                                                    ┌──────────▼──────────┐
                                                    │  Loop 3:            │
                                                    │  RAG                │
                                                    │                     │
                                                    │  Trace failures     │
                                                    │  reveal knowledge   │
                                                    │  gaps → improve     │
                                                    │  retrieval index    │
                                                    └──────────┬──────────┘
                                                               │ richer context
                                                    ┌──────────▼──────────┐
                                                    │  Loop 4:            │
                                                    │  Consensus          │
                                                    │  deliberation       │
                                                    │                     │
                                                    │  Calibrate          │
                                                    │  deliberation       │
                                                    │  priors from        │
                                                    │  outcomes →         │
                                                    │  cleaner quality    │
                                                    │  signal back to     │
                                                    │  Loop 1             │
                                                    └─────────────────────┘

  ─────────────────────────────────────────────────────────────────────────
  All four loops are implemented. 1,388 tests passing.

The compounding property: Each loop feeds the next. Better traces → better optimizer proposals → better specs → better traces. Fine-tuned worker models produce better outputs → fewer judge rejections → cleaner quality signal. The harness measurably improves the more it runs, without engineering effort after initial deployment.


Key concepts

Concept Description
Spec YAML file defining the complete workflow — model tiers, stages, safety rules, memory
Stage One unit of work: an LLM call, script, gate, direct tool call, or subagent
DAG Stages declare depends_on; the engine resolves execution order
Context Shared dict that accumulates stage outputs; every stage sees all upstream results
Model tiers Named model slots (tiny, small, medium, large, frontier); the using app defines what each name maps to (provider, model, temperature, max_tokens)
Role type defaults Maps role types to tiers automatically (worker → small, judge → frontier, etc.); stages can omit model_tier and inherit from this mapping
Native tool calling Stages declare role.tools to scope which registry tools they can call; the engine runs a ReAct dispatch loop — tool calls returned by the model are executed and results fed back until a final response is produced
Direct tool call A tool_call stage invokes a registered tool without an LLM — deterministic, zero-latency, no JSON hallucination. Args are Jinja2-rendered against context.
Mission context A mission: field on the spec is automatically injected into every LLM stage's system prompt, anchoring agents to the stated goal across long-running workflows and including a compact prior-stage breadcrumb
Continuation A continuation: block carries selected stage outputs from a prior run into the next activation via carry_forward key references; the merged values arrive under an inject_as context key (default: prior_run). Enables long-horizon workflows that accumulate state across repeated executions without custom code.
Triggers A triggers: list declares cron (schedule expression) and webhook (HTTP path) trigger sources. armature watch <spec> runs a persistent dispatcher that fires Harness.run() on every matching event.
Response stage Mark one text-mode LLM stage as response_stage: true to enable token streaming; the HTTP service forwards each token to the SSE stream immediately and fires a response_stage_complete event so clients can render the answer before background stages finish
Context filtering A stage's signature.input declares which context keys appear in its prompt — keeps prompts focused, hides internal state from irrelevant stages
Cross-run memory The memory: spec section captures stage outputs across runs and injects them into subsequent runs — lets workflows accumulate knowledge without code changes
IHR Implicit Harness Rating — 5-component quality score: output validity (35%), success rate (25%), quorum score (20%), latency (10%), harness-following rate / HFR (10%). HFR = fraction of stages that succeed without escalation, per arXiv:2605.30621v1
Sandbox isolation sandbox.mode: docker routes shell, file_write, and file_read tool calls through ephemeral Docker containers — network-isolated, CPU/memory bounded, workspace-scoped. Per-stage image overrides with sandbox_image. Image content digest recorded on every trace for audit.
Templates Pre-built spec files for common patterns (Six Thinking Hats deliberation, etc.)

Examples

examples/ — annotated workflows you can copy and modify:

File What it demonstrates
01_hello_world.yml Minimal single-stage LLM workflow
02_research_pipeline.yml Multi-stage pipeline with dependencies
03_deliberation_standard.yml Judge/evaluator pattern with quality scoring
starter_template.yml Full-featured reference — every section documented inline, showing model tiers, context filtering, cross-run memory, safety rules, guided JSON, and a human gate

Templates

Ready-to-use deliberation patterns in armature/templates/:

Template Pattern
six_thinking_hats.yml Edward de Bono's Six Thinking Hats — structured multi-perspective deliberation

Project layout

armature/
├── nodes/          # Stage executors (LLMNode, ScriptNode, HumanGateNode, SubagentNode)
├── registry/       # Tool registry, built-in tools, ToolDescriptor, reversibility
├── runtime/        # DAG executor, engine, prompt assembler, context manager
├── spec/           # YAML loader, Pydantic models (HarnessSpec, Stage, SandboxConfig, ...)
├── hooks/          # Lifecycle hooks, safety rule evaluation, PostconditionFailed
├── permissions/    # PermissionLevel, PermissionChecker
├── optimizer/      # Meta-Harness: trace-driven spec optimization, ProposalStore
├── synthesis/      # SelfImproveRunner, SpecRefiner, DiagnosticAnalyzer, TraceExporter
├── state/          # TraceStore, MemoryStore, SessionLog, ArtifactStore (SQLite + JSONL)
├── report/         # Rich dashboard, sparkline, aggregator, panels
├── sandbox/        # DockerSandboxProvider — shell/file tool sandboxing
├── emitters/       # HermesEmitter — agent bundle generation
├── adapters/       # Observability adapters (LangFuse, LangSmith)
├── templates/      # Reusable workflow spec templates
├── service/        # FastAPI HTTP service — WorkflowRegistry, build_app(), /workflows API
└── cli.py          # CLI entry point

examples/           # Annotated workflow YAML specs (copy and modify)
docs/               # Full documentation (see index below)

Documentation

Getting started

Document Purpose
BUILD_FIRST_WORKFLOW Hands-on tutorial — build a working workflow from scratch
USER-GUIDE Full spec reference — every field, every option, worked examples
ARMATURE-SPEC-REF All spec fields and valid values on one page
FAQ Common questions — positioning, capabilities, comparisons

Design & philosophy

Document Purpose
ARCHITECTURE Design rationale, research foundation, implementation table
ARMATURE-PHILOSOPHY Why a harness — philosophy, research papers, architecture deep-dive
DECLARATIVE-CONTROL-FLOW YAML-first control flow — branching, loops, conditions
DAG-vs-LANGGRAPH How Armature's DAG model compares to LangGraph
MISSION-AS-CONTEXT Mission statements as persistent agent context
ROLE-TAXONOMY Agent role definitions and the role system
MODEL-TIERS Routing work across SLM workers and frontier orchestrators

Patterns & features

Document Purpose
JUDGE-PATTERN Output validation with judge agents
QUORUM-SCORING Deliberative quality scoring across agents
FAN-IN_FAN-OUT Parallel fan-out and aggregation patterns
SUBAGENT-COMPOSITION Composing workflows from subagent stages
CONTEXT-ISOLATION Isolating subagent context for focus and safety
MEMORY-AND-CONTEXT Memory persistence and context management
CHECKPOINT-AND-RESUME Execution state persistence and resumption
CHATBOT-AND-STREAMING Chat applications and streaming responses
HUMAN-IN-THE-LOOP Approval gates and human decision points
IHR-AND-SELF-IMPROVEMENT The IHR formula and self-improvement loop

Operations & safety

Document Purpose
ARMATURE-IN-PRODUCTION Running Armature in production — patterns and case studies
SAFETY-AND-GOVERNANCE Safety rules, governance, and guardrails
SANDBOX-AND-ISOLATION Sandboxed tool execution (Docker isolation)
INTEGRATION LangGraph sidecar pattern, HTTP endpoint reference

Project

Document Purpose
CONTRIBUTING How to run tests, PR conventions, adding tools and commands
CHANGELOG Release history
ROADMAP Where Armature is headed
SECURITY Reporting vulnerabilities

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

armature_agents-0.2.0.tar.gz (505.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

armature_agents-0.2.0-py3-none-any.whl (156.9 kB view details)

Uploaded Python 3

File details

Details for the file armature_agents-0.2.0.tar.gz.

File metadata

  • Download URL: armature_agents-0.2.0.tar.gz
  • Upload date:
  • Size: 505.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for armature_agents-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5afcff499b3e6b2c65eafae8f2955e3a6de59437e01de7f95444646c14f35e49
MD5 94ea6910b615fd399cc8ff1322f994c5
BLAKE2b-256 1a633f8177f5d953b25d488c6b44fc181f49142f9888e067ebb3ddd9d71aabf5

See more details on using hashes here.

File details

Details for the file armature_agents-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for armature_agents-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6fd68effe486c02349d0092e942031f5433e25a73afe2657c04b775fc7506213
MD5 0d46baa354ce5ff5f5e929e8e44f008b
BLAKE2b-256 7fbe441bf3576c74d0c53d5f3429c4aa06e602c9d753d8c44bbed53b80c36f12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page