Skip to main content

Model-agnostic, MCP-native agent harness

Project description

JeevesAgent

Production-ready async agent harness. Multi-tenant by default, typed outputs, retries on transient errors, model-agnostic, MCP-native.

๐Ÿ“– Docs โ€” https://jeevesagent.readthedocs.io ย ย ยทย ย  Migrating? โ€” from LangGraph ย ยทย  from raw OpenAI SDK ย ย ยทย ย  Changelog โ€” CHANGELOG.md

import asyncio
from pydantic import BaseModel
from jeevesagent import Agent

class WeatherReport(BaseModel):
    city: str
    temp_c: float
    conditions: str

async def main():
    agent = Agent("Be precise.", model="gpt-4.1-mini")

    # Free-form run, scoped to a user (memory partitions automatically).
    r = await agent.run("Hi, my name is Alice.", user_id="alice")
    print(r.output)

    # Same agent, structured output, conversation continues.
    r = await agent.run(
        "Weather in Tokyo right now: sunny, 22ยฐC, light wind. Extract.",
        user_id="alice",
        session_id="conv_42",
        output_schema=WeatherReport,
    )
    report: WeatherReport = r.parsed   # โ† typed, validated
    print(f"{report.city}: {report.temp_c}ยฐC โ€” {report.conditions}")

asyncio.run(main())

Set OPENAI_API_KEY and run. Swap "gpt-4.1-mini" for "claude-opus-4-7", "mistral-large", "command-r-plus", "echo" (zero-key fake), or any of ~100 providers via LiteLLM.

What's actually different about this framework:

  • user_id is a first-class typed primitive. One shared Agent + one shared Memory partitions automatically across N tenants with no cross-contamination. No more "forgot to namespace" data leaks.
  • output_schema= accepts any Pydantic model. The framework augments the system prompt, parses the result, validates, retries-with-feedback on validation failure. Typed outputs by default, free-text by omission.
  • Network model adapters are auto-wrapped with a typed error taxonomy + retry policy. Rate limits, 5xx, network blips don't blow up your run. Resilient by default.
  • session_id is a real conversation handle. Reuse it across agent.run() calls and prior turns rehydrate as real chat history. No reducer protocol, no add_messages magic.
  • The agent loop is a strategy. Twelve architectures shipped (ReAct, Self-Refine, Reflexion, TreeOfThoughts, PlanAndExecute, ReWOO, Router, Supervisor, ActorCritic, MultiAgentDebate, Swarm, Blackboard) behind one Agent constructor. One kwarg flips the iteration pattern.
  • Async-only, anyio everywhere, structured concurrency cancellation works correctly. Fast path when production features (audit / OTel / permissions / hooks / journaling) aren't wired up.

โš ๏ธ model is required as of v0.2.0. Earlier 0.1.x releases silently defaulted to EchoModel which produced confusing output; now the harness fails fast with a helpful error if you forget.


Why pick this over LangGraph / CrewAI / AutoGen

Every agent framework forces a choice you shouldn't have to make:

  • LangChain / LangGraph lock you into a graph editor and a specific state model. user_id is a string in config["configurable"] โ€” typo it once and you silently leak data across tenants. Structured outputs and retries are developer-side concerns.
  • Claude Agent SDK is excellent if you're committed to Anthropic forever. It's not model-agnostic.
  • OpenAI Assistants is a black box you don't run yourself.
  • CrewAI / AutoGen are abstractions over LangChain โ€” same problems.

JeevesAgent is the harness for engineers shipping production agents without binding their stack to one model lab โ€” and without wiring multi-tenancy / structured outputs / retries by hand.

Capabilities at a glance:

  • Model-agnostic โ€” Anthropic, OpenAI, and ~100 more via LiteLLM behind one Model protocol. String-based resolver: model="claude-opus-4-7", "gpt-4.1-mini", "mistral-large", โ€ฆ
  • Pluggable architectures โ€” twelve shipped, same Agent surface, one kwarg switches the iteration strategy.
  • MCP-native โ€” MCP is the tool spine, not an integration. Jeeves Gateway / Composio / any MCP server plugs into a single MCPRegistry.
  • Memory done right โ€” five backends (in-memory / vector / Chroma / Postgres+pgvector / Redis), pluggable embedders, and bi-temporal facts that track when claims were true in the world vs when you learned them. All five backends partition by user_id.
  • Durable runtime โ€” SqliteRuntime gives crash-recovery replay with zero infrastructure. Postgres also supported.
  • Observable โ€” OpenTelemetry spans and metrics for every step. Drop in your exporter (Honeycomb / Datadog / LangSmith).
  • Safe โ€” permission policies, sandbox layers, append-only HMAC-signed audit log, freshness/lineage policies for certified values.
  • Async-only, structured concurrency โ€” anyio everywhere, zero raw asyncio.create_task / gather. Parallel tool dispatch via task groups. Backpressure-aware streaming.

Three principles govern every line of code:

  1. The loop is deterministic; the world isn't. Every side effect goes through runtime.step(...) so it can be cached and replayed.
  2. Trust boundary stays outside the sandbox. The harness runs tools inside a sandbox; the harness doesn't run inside one.
  3. Validate state on write, not on read. Pydantic everywhere.

Install

pip install jeevesagent

# Pick the extras you need:
pip install 'jeevesagent[anthropic]'    # Claude
pip install 'jeevesagent[openai]'       # GPT
pip install 'jeevesagent[postgres]'     # PostgresMemory + facts
pip install 'jeevesagent[mcp]'          # real MCP client
pip install 'jeevesagent[otel]'         # OpenTelemetry exporters

# Or install everything for development:
pip install -e '.[dev,anthropic,openai,mcp,postgres,otel]'

Requires Python 3.11+.


30-second quickstart

import asyncio
from jeevesagent import Agent, tool

@tool
async def get_weather(city: str) -> str:
    """Look up the current weather."""
    return f"It's sunny and 72ยฐF in {city}."

async def main():
    agent = Agent(
        "You are a travel assistant.",
        model="claude-opus-4-7",       # or "gpt-4o", or any Model instance
        tools=[get_weather],
    )
    result = await agent.run("What's the weather like in Tokyo?")
    print(result.output)
    print(f"Used {result.tokens_in + result.tokens_out} tokens, ${result.cost_usd:.4f}")

asyncio.run(main())

Set ANTHROPIC_API_KEY (or OPENAI_API_KEY) before running. That's it โ€” no LangChain, no LangGraph, no chat_engine = AgentExecutor.from_llm_and_tools(...).

Want to see what's happening as the agent runs?

async for event in agent.stream("plan a 3-day Tokyo trip"):
    print(f"[{event.kind}] {event.payload}")

You'll see STARTED โ†’ MODEL_CHUNK ร— N โ†’ TOOL_CALL โ†’ TOOL_RESULT โ†’ MODEL_CHUNK ร— N โ†’ COMPLETED flow through.


Architectures: the agent loop is a strategy

The default loop is ReAct (observe / think / act). When that doesn't fit your problem, swap it with one kwarg โ€” everything else (model, memory, tools, budget, telemetry, runtime) stays exactly the same.

Single-agent loops: pass architecture=

from jeevesagent import Agent

agent = Agent("...", model="claude-opus-4-7")                            # ReAct default
agent = Agent("...", model="...", architecture="self-refine")            # iterate until critic happy
agent = Agent("...", model="...", architecture="reflexion")              # verbal RL with lessons
agent = Agent("...", model="...", architecture="plan-and-execute")       # plan once, execute steps
agent = Agent("...", model="...", architecture="rewoo")                  # plan + parallel tools, 30-50% cheaper
agent = Agent("...", model="...", architecture="tree-of-thoughts")       # BFS beam over candidate thoughts

Multi-agent teams: use Team builders (the ergonomic facade)

Team mirrors the builder shape every other framework uses (create_supervisor / Crew / GroupChatManager) so migrating from LangGraph / CrewAI / AutoGen / OpenAI Agents SDK is muscle-memory. Each builder returns a regular Agent โ€” same .run() / .stream() interface, no special calling convention.

from jeevesagent import Agent, Team, RouterRoute

# Coordinator + workers; the manager calls delegate(...) or forward_message(...)
team = Team.supervisor(
    workers={"researcher": researcher, "writer": writer, "reviewer": reviewer},
    instructions="manage the pipeline",
    model="claude-opus-4-7",
)

# Classify-and-dispatch โ€” cheaper than Supervisor when one specialist
# is enough (1 classifier call + 1 specialist run, no synthesis pass)
team = Team.router(
    routes=[
        RouterRoute(name="billing", agent=billing, description="..."),
        RouterRoute(name="tech",    agent=tech,    description="..."),
    ],
    instructions="customer support entry point",
    model="claude-haiku-4-5",
)

# Peer agents passing control via typed handoffs (input_type= for
# structured payloads, input_filter= for selective history pruning)
team = Team.swarm(
    agents={"triage": triage, "billing": billing, "tech": tech},
    entry_agent="triage",
    model="claude-opus-4-7",
)

# Actor + critic with different models for blind-spot diversity
team = Team.actor_critic(
    actor=Agent("...", model="claude-opus-4-7"),
    critic=Agent("...", model="gpt-4o"),       # different model
    max_rounds=3,
    approval_threshold=0.9,
    model="claude-opus-4-7",                    # coordinator
)

# N debaters + optional judge with similarity-based early termination
team = Team.debate(
    debaters=[optimist, skeptic, analyst],
    judge=cio,
    rounds=2,
    convergence_similarity=0.85,
    model="claude-opus-4-7",
)

# Coordinator + agents share a workspace; decider synthesizes
team = Team.blackboard(
    agents={"hypothesis": h_agent, "evidence": e_agent, "critic": c_agent},
    coordinator=coord_agent,
    decider=decider_agent,
    model="claude-opus-4-7",
)

Recursive composition (the differentiator)

Architectures wrap each other naturally โ€” the property no sibling-only framework gives you. Wrap a Supervisor in Reflexion for cross-session learning of delegation patterns; nest Supervisors for hierarchical teams; wrap an entire pipeline in Reflexion to retry on low scores:

from jeevesagent import Agent, Reflexion, Supervisor

agent = Agent(
    "...",
    model="claude-opus-4-7",
    architecture=Reflexion(
        base=Supervisor(workers={"researcher": ..., "writer": ...}),
        max_attempts=3,
        threshold=0.85,
        lesson_store=InMemoryVectorStore(embedder=HashEmbedder()),  # selective recall
    ),
)

The explicit nested form (Agent(architecture=...)) and Team builders are interchangeable โ€” Team.supervisor(workers={...}) is exactly Agent(architecture=Supervisor(workers={...})) under the hood. Use Team for single-level teams (matches what you've seen in other frameworks); use the nested form for recursive composition.

Standalone testing of orchestrators

from jeevesagent import Supervisor, run_architecture

sup = Supervisor(workers={"a": agent_a})
result = await run_architecture(sup, "do the thing", model="claude-opus-4-7")

Architectures are pluggable via the Architecture protocol โ€” three methods (name, run, declared_workers) and you have a custom strategy. See Subagent.md for the full design rationale.


Architecture cheat sheet

Visual reference for picking the right pattern. Each diagram shows the actual data flow + LLM-call structure for that architecture.

Single-agent loops

ReAct โ€” observe / think / act loop. The default. One model call per turn; tools dispatch in parallel.

                 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ loop until no tool calls โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                 โ”‚                                              โ”‚
   prompt โ”€โ”€โ”€โ–บ Model โ”€โ”€โ”€โ–บ tool calls? โ”€โ”€yesโ”€โ”€โ–บ run tools โ”€โ”€โ–บ results
                 โ”‚                              (parallel)
                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ no calls โ”€โ”€โ”€โ–บ final output

SelfRefine โ€” single-agent generate โ†’ critique โ†’ refine. Same model wears both hats.

   prompt โ”€โ”€โ”€โ–บ generate โ”€โ”€โ”€โ–บ critique โ”€โ”€โ”ฌโ”€โ”€ score โ‰ฅ threshold โ”€โ”€โ–บ output
                              โ–ฒ         โ”‚
                              โ”‚         โ””โ”€โ”€ below โ”€โ”€โ–บ refine โ”€โ”€โ”
                              โ”‚                                โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Reflexion โ€” wraps any base architecture with verbal-RL retry. Failed attempts produce a "lesson" stored in memory or a vector store; next attempt sees the relevant lessons.

   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ attempt loop (max_attempts) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚                                                    โ”‚
   โ”‚   prompt โ”€โ”€โ–บ [recall lessons] โ”€โ”€โ–บ base.run() โ”€โ”€โ–บ evaluator
   โ”‚                                                    โ”‚
   โ”‚                                              score < threshold?
   โ”‚                                                    โ”‚
   โ”‚                                              yes โ”€โ”€โ”ดโ”€โ”€ no โ”€โ”€โ–บ output
   โ”‚                                                    โ”‚
   โ”‚                                              reflector โ”€โ”€โ–บ lesson
   โ”‚                                                    โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ persist โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                          โ”‚
                          memory block  OR  vector store (selective recall)

TreeOfThoughts โ€” BFS beam search over candidate thoughts. Proposer + evaluator at every depth; beam keeps top-k; min_score floor drops weak branches early.

              proposer (ร—branch_factor)         evaluator
   prompt โ”€โ”€โ–บ [t1, t2, t3]  โ”€โ”€scoreโ”€โ”€โ–บ  [0.9, 0.4, 0.7]
                                              โ”‚
                                         keep top beam_width
                                         drop below min_score
                                              โ”‚
                                              โ–ผ
                                         [t1, t3]   โ†โ”€โ”€ frontier for depth 2
                                              โ”‚
                                         (repeat to max_depth)
                                              โ”‚
                                              โ–ผ
                                       best leaf wins

PlanAndExecute โ€” planner emits a step list once; executor walks each step; synthesizer composes the final answer.

   prompt โ”€โ”€โ”€โ–บ planner โ”€โ”€โ”€โ–บ [step1, step2, step3]
                                     โ”‚
                                     โ–ผ
                              executor (per step) โ”€โ”€โ”€โ–บ [r1, r2, r3]
                                                            โ”‚
                                                            โ–ผ
                                                      synthesizer โ”€โ”€โ”€โ–บ output

ReWOO โ€” like PlanAndExecute but the planner emits structured tool calls with {{En}} placeholders, and independent steps run in parallel. Two LLM calls + N tool calls โ€” 30-50% cheaper than ReAct on tool-heavy workloads.

   prompt โ”€โ”€โ”€โ–บ planner โ”€โ”€โ”€โ–บ [search({{E1}}), fetch({{E2}}=search.url)]
                                          โ”‚
                                          โ–ผ
                            parallel tool dispatch
                            (independent steps run concurrently;
                             dependent steps wait for {{En}})
                                          โ”‚
                                          โ–ผ
                                    synthesizer โ”€โ”€โ”€โ–บ output

Multi-agent teams

Router โ€” classify-and-dispatch. ONE classifier call decides which specialist runs; that one specialist owns the answer.

                       โ”Œโ”€โ”€ refund_agent
   prompt โ”€โ”€โ–บ classifier โ”€โ”€โ–บ technical_agent      โ—„โ”€โ”€ only ONE
                       โ””โ”€โ”€ faq_agent โ—„โ”€โ”€ chosen      runs

   1 classifier call + 1 specialist run. The cheapest multi-agent pattern.

Supervisor โ€” coordinator + workers, glued by a delegate(worker, instructions) tool. Multiple delegations in one supervisor turn run in parallel. forward_message(worker) returns a worker's output verbatim with no synthesis.

   prompt โ”€โ”€โ”€โ–บ manager โ”€โ”€โ”€โ–บ delegate(...) โ”€โ”ฌโ”€โ–บ worker A โ”€โ”
                              โ”‚            โ”œโ”€โ–บ worker B โ”€โ”ค  parallel
                              โ”‚            โ””โ”€โ–บ worker C โ”€โ”ค
                              โ–ผ                          โ”‚
                          [worker outputs] โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ”œโ”€โ–บ synthesize โ”€โ”€โ–บ output
                              โ”‚
                              โ””โ”€โ–บ forward_message(worker) โ”€โ”€โ–บ verbatim output

ActorCritic โ€” actor + critic pair (use different models for blind-spot diversity). Critic returns structured JSON {score, issues, summary}; actor refines below threshold.

   prompt โ”€โ”€โ”€โ–บ actor โ”€โ”€โ”€โ–บ critic โ”€โ”€โ”ฌโ”€โ”€ score โ‰ฅ threshold โ”€โ”€โ–บ output
                  โ–ฒ                โ”‚
                  โ”‚                โ””โ”€โ”€ below โ”€โ”€โ–บ refine (apply rubric)
                  โ”‚                                  โ”‚
                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ max_rounds cap โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

MultiAgentDebate โ€” N debaters argue across rounds (in parallel each round). Jaccard convergence detects early agreement; optional judge synthesizes the final answer.

   prompt โ”€โ”€โ–บ [debater1, debater2, debater3]   โ—„โ”€โ”€ round 1 (parallel)
                              โ”‚
                       converged? (Jaccard โ‰ฅ 0.85)
                       yes โ”€โ”€โ”€โ–บ output
                       no  โ”€โ”€โ”€โ–บ [responses fed back]
                              โ”‚
              [debater1, debater2, debater3]    โ—„โ”€โ”€ round 2 (sees prior)
                              โ”‚
                              โ–ผ
                          judge โ”€โ”€โ–บ output     (or majority vote if no judge)

Swarm โ€” peer agents handing off control via a handoff tool (or per-target transfer_to_<name> tools when peers are wrapped in Handoff with an input_type). No central coordinator.

   prompt โ”€โ”€โ–บ agent A
                 โ”‚
                 โ”‚ handoff(B, payload)
                 โ–ผ
              agent B
                 โ”‚
                 โ”‚ transfer_to_C(typed_args)
                 โ–ผ
              agent C โ”€โ”€โ–บ final output
                 โ–ฒ
                 โ”‚ cycle detection: Aโ†’Bโ†’Aโ†’B kills the loop
                 โ”‚ max_handoffs caps total depth

BlackboardArchitecture โ€” agents collaborate via a shared mutable workspace. Coordinator picks who acts next; decider says when work is done.

                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ shared blackboard โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚   facts ยท hypotheses ยท partial results       โ”‚
                โ””โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ r/w  โ”‚ r/w  โ”‚ r/w  โ”‚ r/w        โ”‚
                     โ”‚      โ”‚      โ”‚      โ”‚            โ”‚
   prompt โ”€โ”€โ–บ coordinator โ”€โ”€โ–บ picks who acts next      โ”‚
                     โ”‚      โ”‚      โ”‚      โ”‚            โ”‚
                  agent A  agent B  agent C            โ”‚
                     โ”‚      โ”‚      โ”‚      โ”‚            โ”‚
                     โ–ผ      โ–ผ      โ–ผ      โ–ผ            โ”‚
                              decider โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                                 โ”œโ”€ done? โ”€โ”€โ–บ output
                                 โ”‚
                                 โ””โ”€ not done โ”€โ”€โ–บ next round

Recursive composition

Any architecture can wrap any other. The killer combination: Reflexion of Supervisor โ€” the team learns across attempts which worker handles which intent best.

   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€ Reflexion attempt loop โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚                                     โ”‚
   โ”‚   prompt โ”€โ”€โ–บ Supervisor โ”€โ”€โ–บ output โ”€โ”คโ”€โ”€ score โ‰ฅ threshold โ”€โ”€โ–บ done
   โ”‚              (manager + 3 workers)  โ”‚
   โ”‚                                     โ”‚
   โ”‚                                     โ””โ”€โ”€ below โ”€โ”€โ–บ lesson โ”€โ”€โ–บ retry
   โ”‚                                                                โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
agent = Agent(
    "...",
    model="claude-opus-4-7",
    architecture=Reflexion(
        base=Supervisor(workers={"researcher": ..., "writer": ..., "reviewer": ...}),
        lesson_store=InMemoryVectorStore(embedder=HashEmbedder()),  # selective recall
    ),
)

Skills: packaged playbooks the agent loads on demand

Tools tell the agent what it can do. Skills tell it how โ€” domain-specific recipes the agent reads when relevant, ignores when not. Same shape as Anthropic Agent Skills (Oct 2025): a directory with SKILL.md (frontmatter + markdown body) and optional bundled files. Drop your existing Anthropic-format skills into our skills=[...] and they Just Work.

from jeevesagent import Agent

agent = Agent(
    "...",
    model="claude-opus-4-7",
    skills=[
        "~/.jeeves/skills/system/",          # base layer
        "~/.jeeves/skills/user/",            # user override
        ("./.jeeves-skills/", "Project"),    # project-local with label
    ],
)

Progressive disclosure: only name + description (~50 tokens per skill) load into the system prompt at startup. The model calls a load_skill(name) tool when a skill is relevant โ€” only THEN does the full body enter context. A 50-skill agent costs ~2,500 tokens at rest; nothing more until the model actually loads one.

Three skill modes โ€” coexist freely in any skill

skills/my-skill/
โ”œโ”€โ”€ SKILL.md         โ† required: frontmatter + markdown body
โ”œโ”€โ”€ tools.py         โ† OPTIONAL: @tool functions (Mode B, in-process Python)
โ””โ”€โ”€ scripts/         โ† OPTIONAL: executable scripts (Mode A or Mode C)
    โ””โ”€โ”€ helper.py

Mode A โ€” pure markdown. SKILL.md teaches the model how to use your existing tools (read, write, bash). The model issues those tool calls itself based on the body's instructions.

Mode C โ€” frontmatter declares a script as a typed tool. Any language. The framework wraps the script in a subprocess-backed Tool with proper args; the model calls it like any built-in tool.

---
name: calc
description: Arithmetic helpers.
tools:
  add:
    description: Sum two integers.
    script: scripts/add.py
    args:
      a:
        type: string
        description: First int
      b:
        type: string
        description: Second int
---
# scripts/add.py โ€” plain Python, no decorators
import sys
print(int(sys.argv[1]) + int(sys.argv[2]))

The model calls calc__add(a="2", b="3") โ†’ framework execs the script โ†’ captures stdout โ†’ returns to the model.

Mode B โ€” tools.py ships @tool functions. Auto-discovered by filename presence; imported at construction; registered into the agent's tool host when the skill is loaded.

# skills/greeter/tools.py
from jeevesagent import tool

@tool
async def say_hi(name: str) -> str:
    """Say hi."""
    return f"Hi {name}!"

The model calls greeter__say_hi(name="Anupam") directly. In-process, fast, can share the agent's state.

Auto-namespacing prevents collisions

Tool names get prefixed with the skill name automatically:

Skill ships Registered as
add (Mode C, calc skill) calc__add
say_hi (Mode B, greeter skill) greeter__say_hi
search (in two skills A and B) a__search and b__search โ€” no clash

Inline skills โ€” one-off in code

For tiny one-off skills that don't justify a folder:

from jeevesagent import Skill

skill = Skill.from_text("""
---
name: standup
description: Format a daily standup from rough notes.
---
# Standup
Always 3 sections: Yesterday, Today, Blockers.
""")

agent = Agent("...", skills=[skill])

Layered sources with last-wins override

When two sources ship a skill with the same name, the later source wins. Lets you stack: system โ†’ user โ†’ project.

agent = Agent(
    skills=[
        "~/.jeeves/skills/system/",      # base
        "~/.jeeves/skills/user/",        # user customizes
        "./.jeeves-skills/",             # project-local override
    ],
)

See the examples/ directory for runnable end-to-end samples that exercise the loader, vector store, retriever-as-tool pattern, and multi-agent debate.


Fast path by default

JeevesAgent ships with the full production surface โ€” audit log, OTel telemetry, permissions, hooks, durable runtime, budget โ€” but you don't pay for what you don't wire up. Every layer has a no-op default, and the loop detects those defaults at construction time and skips the integration points entirely on the hot path.

A barebones Agent("hi", model="gpt-4.1-mini", tools=[...]) runs without going through the audit / telemetry / permissions / hook / journaling / budget layers at all. The moment you pass audit_log=, telemetry=, permissions=, runtime=, etc., the corresponding layer flips on and the integration becomes active โ€” same Agent class, same API, no flags to set.

                  default Agent              production Agent
                  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
audit_log         None        โ†’ SKIP       FileAuditLog(...)    โ†’ wired
telemetry         NoTelemetry โ†’ SKIP       OTelTelemetry(...)   โ†’ wired
permissions       AllowAll    โ†’ SKIP       StandardPermissions  โ†’ wired
hooks             empty       โ†’ SKIP       @before_tool/@after_tool โ†’ wired
runtime           InProc      โ†’ INLINE     SqliteRuntime(...)   โ†’ wired
budget            NoBudget    โ†’ SKIP       StandardBudget(...)  โ†’ wired

When a layer is detected as no-op, the loop:

  • skips the await audit_log.append(...) call (so even the function call dispatch is removed)
  • skips telemetry.trace(...) async-context-manager entry/exit and the kwargs-dict construction for emit_metric calls
  • skips permissions.check(call, context={}) (returns allow_() inline)
  • skips hooks.pre_tool / hooks.post_tool iteration
  • inlines await fn(*args) instead of routing through runtime.step(name, fn, ...) โ€” saves the idempotency-key hash derivation per tool call
  • skips budget.allows_step() / budget.consume(...)

The point: "framework is slow because it's full-featured" stops being the trade-off. You get the harness when you want it, the speed when you don't, with no code changes between modes.


Resilient by default

Real model APIs fail. Rate limits, 5xx blips, transient connection drops happen on every production deployment. JeevesAgent ships retry on transient errors enabled by default for the in-tree network adapters (OpenAI, Anthropic, LiteLLM) โ€” the moment you construct a real-world agent it's already covered:

agent = Agent("...", model="gpt-4.1-mini")
# Default policy: 3 attempts, 1 s โ†’ 2 s โ†’ 4 s backoff
# (capped at 30 s, ยฑ10% jitter), respects provider Retry-After.

The framework normalises every model SDK's exceptions into a small typed taxonomy so callers + the retry layer reason about failures uniformly:

ModelError                       โ€” base (catch-all model failure)
โ”œโ”€โ”€ TransientModelError          โ€” retry-able
โ”‚   โ””โ”€โ”€ RateLimitError           โ€” 429 / quota; carries retry_after
โ””โ”€โ”€ PermanentModelError          โ€” don't retry
    โ”œโ”€โ”€ AuthenticationError      โ€” bad API key
    โ”œโ”€โ”€ InvalidRequestError      โ€” malformed prompt / args
    โ””โ”€โ”€ ContentFilterError       โ€” safety system rejection

classify_model_error(exc) does the SDK-specific mapping (lazy imports, no hard dependency on any provider package). The wrapper treats TransientModelError as retryable, PermanentModelError as fatal, and any unrecognised exception is propagated unchanged โ€” the framework refuses to silently retry errors it doesn't understand.

Tune the policy per-Agent:

from jeevesagent import Agent, RetryPolicy

# Default (production-sensible)
agent = Agent("...", model="gpt-4.1-mini")

# Aggressive โ€” tolerates long provider blips
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy.aggressive())

# Disabled โ€” handle errors yourself
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy.disabled())

# Custom
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy(
                  max_attempts=4,
                  initial_delay_s=0.5,
                  max_delay_s=15.0,
              ))

Behaviour highlights:

  • Provider-supplied Retry-After is honoured โ€” when a 429 response carries the header, the framework waits at least that long before the next attempt (even if it exceeds max_delay_s). Provider authority wins over local heuristics.
  • Streaming retries fire before the first chunk โ€” once the consumer has received any tokens we cannot rewind, so mid-stream errors propagate. Pre-first-chunk failures are retried per policy.
  • Custom Models are not auto-wrapped. The framework only wraps its in-tree adapters by default because it knows their error classes. Custom Models opt in by passing retry_policy= explicitly to Agent(...).

Structured outputs

Production agents need to emit data, not free-form prose. Pass a Pydantic BaseModel as output_schema= to agent.run(...) and the framework gives you a typed, validated instance:

from pydantic import BaseModel
from jeevesagent import Agent

class CompanyInfo(BaseModel):
    name: str
    founded_year: int
    headquarters: str

agent = Agent("extract company info", model="gpt-4.1-mini")
result = await agent.run("Tell me about Acme.", output_schema=CompanyInfo)

info: CompanyInfo = result.parsed   # โ† typed, validated
print(info.founded_year)            # 2008
print(result.output)                # raw JSON text, still available

What the framework does:

  1. Schema-aware system prompt โ€” appends a STRUCTURED OUTPUT REQUIRED directive to the run's instructions, embedding the schema's JSON Schema. Your static Agent(...) instructions are not mutated; the augmentation is per-run.
  2. Tolerates real-world model quirks โ€” strips ```json / ``` markdown fences before parsing.
  3. Retry-with-feedback โ€” on a parse failure, the framework gives the model up to output_validation_retries (default 1) extra single-shot turns to fix it, feeding the validation error back as a USER message ("Your previous response failed schema validation: ...; return only a corrected JSON object"). After the retry budget is exhausted, raises OutputValidationError with the underlying Pydantic ValidationError attached as .cause, the bad text on .raw, and the schema on .schema โ€” so callers can build whatever recovery strategy they need.
  4. result.output keeps the (cleaned) raw JSON text so you can log or audit what the model produced; result.parsed holds the validated Pydantic instance.

Set output_validation_retries=0 to fail fast (no recovery turn).

End-to-end demo: examples/04_structured_outputs.py extracts a structured MeetingSummary (with nested ActionItem lists, ISO dates, and a sentiment enum) from a raw meeting transcript.


Multi-tenancy by default

JeevesAgent treats user_id and session_id as first-class typed primitives, not strings buried in a free-form config dict. The moment you pass them to agent.run(...), the framework partitions memory automatically and rehydrates conversation history without any extra plumbing.

result = await agent.run(
    "what is my favourite food?",
    user_id="alice",            # hard namespace partition for memory
    session_id="conv_42",       # conversation thread; reused = continued
    metadata={"locale": "en"},  # free-form bag for app-specific keys
)

What the framework does with these:

  • user_id is a hard partition on every memory primitive. Episodes and facts stored under one user_id are never visible to a recall scoped to a different one. None is its own bucket ("anonymous / single-tenant"). One shared Memory instance can back N concurrent users with zero risk of cross-contamination.
  • session_id is the conversation handle. Reuse the same id across calls and the loop rehydrates prior user/assistant turns as real Message history โ€” the model sees the chat thread, not just a recall summary.
  • metadata rides along on the per-run RunContext and is reachable from any tool / hook via get_run_context() without threading it through every function signature.

Inside a tool, you read scope from the live RunContext:

from jeevesagent import tool, get_run_context

@tool
async def fetch_user_orders() -> str:
    """Look up the current user's recent orders."""
    ctx = get_run_context()
    return await db.query("orders", user_id=ctx.user_id)

The model never sees user_id in the tool schema, can't pass the wrong one, and the framework guarantees the tool gets the right value (set by _loop, propagated through anyio task groups for parallel tool dispatch and sub-agent spawning).

Sub-agents inherit automatically. Every multi-agent architecture (Supervisor, Debate, Swarm, Router, ActorCritic, Blackboard, ReWOO) forwards the parent's RunContext to its workers, so user_id flows through deeply nested agent trees with no per-architecture plumbing. Workers get a fresh session_id so their conversation history stays separate from the parent's.

Footgun protection. When a memory store contains episodes for named users and a recall runs with user_id=None, the framework emits an IsolationWarning โ€” the partition is still safe, but the dev probably forgot to pass user_id= somewhere. Apps that want strict enforcement promote it to an exception:

import warnings
from jeevesagent import IsolationWarning
warnings.simplefilter("error", IsolationWarning)

End-to-end demo: examples/03_multi_user_sessions.py.


Capability matrix

Capability What you get Where
Multi-tenant memory First-class user_id partition + session_id continuity. One shared Memory instance backs N users with no cross-contamination; sub-agents inherit context automatically RunContext, get_run_context, set_run_context, IsolationWarning, Agent.run(user_id=, session_id=, metadata=)
Structured outputs Pass output_schema= to get a typed, validated Pydantic instance back. Framework augments the system prompt with the schema, parses + validates, retries with feedback on failure Agent.run(output_schema=), RunResult.parsed, OutputValidationError
Resilient model calls Network adapters auto-wrapped with retry-on-transient (rate limit, 5xx, network blip). Typed error taxonomy. Provider Retry-After honoured. RetryPolicy, RetryPolicy.disabled/aggressive, ModelError, TransientModelError, RateLimitError, PermanentModelError, AuthenticationError, InvalidRequestError, ContentFilterError, classify_model_error
Architecture protocol Pluggable agent-loop strategy: 12 architectures shipped Architecture, ReAct, SelfRefine, Reflexion, TreeOfThoughts, PlanAndExecute, ReWOO, Router, Supervisor, ActorCritic, MultiAgentDebate, Swarm, BlackboardArchitecture
Team facade Sibling-style builders (Team.supervisor, Team.swarm, Team.router, Team.debate, Team.actor_critic, Team.blackboard) for the common multi-agent shapes Team, Handoff, run_architecture
Vector store add / search / delete with Mongo-style filters, MMR diversity, BM25 hybrid search, save/load InMemoryVectorStore, ChromaVectorStore, PostgresVectorStore, FAISSVectorStore, SearchResult
Document loader One-line load for PDF / DOCX / Excel / CSV / HTML / Markdown into chunks jeevesagent.loader.load, MarkdownChunker, RecursiveChunker, SentenceChunker, TokenChunker
Built-in tools read / write / edit / bash factories with sandbox-aware workdirs read_tool, write_tool, edit_tool, bash_tool, default_workdir
Skills (Anthropic-compatible) Packaged playbooks loaded on demand. Three modes coexist: pure markdown, frontmatter-declared subprocess tools (any language), and tools.py with @tool (Python, in-process). Layered sources with last-wins override. Skill, SkillRegistry, SkillSource, SkillMetadata, SkillError, Agent(skills=...)
Model adapters Anthropic, OpenAI, LiteLLM (~100 providers), Echo (zero-key), Scripted (tests) jeevesagent.AnthropicModel, OpenAIModel, LiteLLMModel, EchoModel, ScriptedModel
String model resolver model="claude-opus-4-7", "gpt-4o", "mistral-large", "command-r", "echo", "litellm/<any>" Agent.__init__
Tools @tool decorator with auto-schema, sync + async; agent.with_tool decorator; add_tool / remove_tool / tools_list jeevesagent.tool, Tool
MCP servers stdio + Streamable HTTP, multi-server registry, name disambiguation MCPRegistry, MCPServerSpec
Jeeves Gateway One-line: tools=JeevesGateway.from_env() jeevesagent.jeeves
Memory backends In-memory dict, vector cosine, Chroma, Postgres+pgvector, Redis InMemoryMemory, VectorMemory, ChromaMemory, PostgresMemory, RedisMemory
Embedders HashEmbedder (deterministic, zero deps), OpenAIEmbedder, VoyageEmbedder, CohereEmbedder HashEmbedder, OpenAIEmbedder, VoyageEmbedder, CohereEmbedder
Bi-temporal facts All five memory backends. LLM-driven Consolidator. Auto-consolidate, plus ConsolidationWorker for long-lived agents. Fact, Consolidator, *FactStore
Durable runtime sqlite or postgres-backed replay across process restarts SqliteRuntime, PostgresRuntime, JournaledRuntime
Streaming agent.stream() โ†’ AsyncIterator[Event] with backpressure Agent.stream
Permissions mode-based + allow/deny lists, mirrors Claude Agent SDK StandardPermissions, Mode
Hooks @agent.before_tool / @agent.after_tool decorators HookRegistry
Sandbox FilesystemSandbox blocks path-arg escapes; SubprocessSandbox for full isolation FilesystemSandbox, SubprocessSandbox
Budget Per-token / per-cost / per-wall-clock limits with soft warnings StandardBudget, BudgetConfig
Telemetry OpenTelemetry spans + metrics for every milestone OTelTelemetry
Audit log HMAC-signed JSONL or in-memory; tracks every tool call FileAuditLog, InMemoryAuditLog
Certified values Freshness + lineage policies FreshnessPolicy, LineagePolicy
Declarative config Build agents from TOML or dicts Agent.from_config(path), Agent.from_dict(cfg)

Documentation

The full Sphinx-built documentation site lives at https://jeevesagent.readthedocs.io โ€” every public symbol is auto-documented from its docstring, and the migration / quickstart guides are mounted alongside the API reference.

Build it locally with:

pip install -e ".[docs]"
sphinx-build -b html docs docs/_build/html
open docs/_build/html/index.html

In-tree starting points:

Doc What's there
docs/quickstart.md Step-by-step examples for each backend combo
docs/recipes.md Production patterns: persistent memory, MCP, durable replay, audit
docs/architecture.md Module tour, lifecycle, extension points
docs/migrations/from-langgraph.md LangGraph โ†’ JeevesAgent translation guide
docs/migrations/from-openai-sdk.md Hand-rolled OpenAI loop โ†’ JeevesAgent translation guide
docs/migration_0.1_to_0.2.md What changed in 0.2.0; how to migrate
CHANGELOG.md Version-by-version release notes
Subagent.md Architecture-protocol design rationale; full 14-architecture catalogue (the 5 shipped, the 9 candidates)
project.md The full engineering plan (the design doc)
BUILD_LOG.md Slice-by-slice changelog
examples/ Four runnable end-to-end samples: 01_rag_pdf.py, 02_specialist_debate.py, 03_multi_user_sessions.py, 04_structured_outputs.py

API stability

The framework is pre-1.0 โ€” major versions can introduce breaking changes โ€” but the surface area is split into stability tiers so adopters know what they can pin against today.

Tier API What it covers
Stable Agent, Agent.run / stream / resume, RunResult, RunContext, get_run_context, set_run_context, Memory protocol, Episode, Fact, Message, Role, Event, Tool, @tool, Model protocol, the error hierarchy under JeevesAgentError, RetryPolicy, OutputValidationError, IsolationWarning Will not break in 0.x without a migration note + deprecation cycle. Pin against these in production code.
Stable backends InMemoryMemory, ChromaMemory, PostgresMemory, RedisMemory, VectorMemory, OpenAIModel, AnthropicModel, LiteLLMModel, EchoModel, ScriptedModel, InProcRuntime, SqliteRuntime, PostgresRuntime, StandardBudget, NoBudget, AllowAll, StandardPermissions, HookRegistry, OTelTelemetry, NoTelemetry, FileAuditLog, InMemoryAuditLog Concrete implementations; constructor signatures stable, behaviour locked.
Experimental MultiAgentDebate / Swarm / Blackboard / ReWOO / TreeOfThoughts (the newer architectures), Skills and SkillRegistry, JeevesGateway, agent.generate_graph(), the Team.* builders Useful, tested, but newer โ€” internal details may change as we collect production feedback. Wrap with your own thin layer if you depend on them.
Internal _loop, _wrapped_model, Dependencies, AgentSession, the architecture protocol's exact shape, anything starting with _ No stability promise. Subject to change without notice.

If a symbol isn't listed, it's experimental by default. Open an issue if you depend on something not yet in the Stable tier and need it promoted.


Status

  • 866 tests pass in ~6 seconds (5 env-gated integrations skip without JEEVES_TEST_PG_DSN / JEEVES_TEST_REDIS_URL)
  • mypy --strict clean across 105 production source files
  • ruff clean including flake8-async lints
  • v0.10 ships multi-tenancy by default, structured outputs, retry-on-transient by default, and the fast path by default. Every layer (audit, telemetry, permissions, hooks, runtime, budget) is detected as no-op or production-wired at construction time, so a barebones Agent runs at LangChain-class latency with the integration skipped. user_id and session_id are first-class typed primitives โ€” memory is hard-partitioned per user_id, conversations continue when a session_id is reused, and sub-agents inherit the parent's RunContext automatically via a contextvar (get_run_context()). Pass output_schema= (any Pydantic BaseModel) and agent.run returns a typed, validated instance on result.parsed โ€” with retry-with-feedback on validation failure. Network model adapters are auto-wrapped with a typed error taxonomy (TransientModelError / RateLimitError / PermanentModelError / AuthenticationError / InvalidRequestError / ContentFilterError) and a configurable RetryPolicy so transient 5xx / 429 / network blips don't blow up production runs. All zero-config; no flags.
  • v0.9 ships Skills (Anthropic Agent Skills format, with tools.py auto-discovery for in-process Python tools and frontmatter tools: manifest for any-language scripts wrapped as typed tools), agent-graph visualization (agent.generate_graph() โ†’ Mermaid / PNG), the Team facade for ergonomic multi-agent construction, the full vector-store stack (InMemoryVectorStore / Chroma / Postgres / FAISS โ€” Mongo-style filters, MMR diversity, BM25 hybrid search, persistence), the document loader with chunking strategies, and 12 architectures including selective lesson recall (Reflexion), typed handoffs (Swarm), forward_message (Supervisor), Jaccard convergence (Debate), and parallel proposer/evaluator with min_score floor (TreeOfThoughts).

Verify your install

git clone <repo>
cd jeevesagent
pip install -e '.[dev]'
ruff check jeevesagent
mypy --strict jeevesagent
pytest tests/ -v

You should see 815 passed. Five integration tests skip without JEEVES_TEST_PG_DSN / JEEVES_TEST_REDIS_URL / API-key env vars set.


Contributing

The harness has a strict CI gate: ruff + mypy --strict + pytest. All three must pass. Async-only โ€” every public function returning anything other than a value is async. Every fan-out uses anyio task groups. Zero raw asyncio.create_task or asyncio.gather calls.

See project.md ยง2 for the non-negotiable engineering principles.


License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jeevesagent-0.9.3.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jeevesagent-0.9.3-py3-none-any.whl (297.4 kB view details)

Uploaded Python 3

File details

Details for the file jeevesagent-0.9.3.tar.gz.

File metadata

  • Download URL: jeevesagent-0.9.3.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for jeevesagent-0.9.3.tar.gz
Algorithm Hash digest
SHA256 945c50745d2cc8781fb119c9c173c028b0fdd9bc6f25311770c8c29255136f70
MD5 1cb5e9ffffcd837e062927bba4aa0619
BLAKE2b-256 ea8d222cf8fd32ea7a6b78695f089a0feaebcb7c424b2abca4f906dbefcb1efc

See more details on using hashes here.

Provenance

The following attestation bundles were made for jeevesagent-0.9.3.tar.gz:

Publisher: release.yml on Anurich/JeevesHarness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jeevesagent-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: jeevesagent-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 297.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for jeevesagent-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7b0d3064787ae30102f58be054fa47b576501f1757121ed92129981bcd1834b4
MD5 94b489b939eecf4a2a4b6c1d43c51d23
BLAKE2b-256 f966df1836d9d3bb7d08711a4d5873cf5021cfa46aa2c1f6dfb91a4bf4cc93a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for jeevesagent-0.9.3-py3-none-any.whl:

Publisher: release.yml on Anurich/JeevesHarness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page