Skip to main content

Checkpoint, recovery, and replay infrastructure for AI agents.

Project description

Living AI

Crash recovery, checkpointing, and replay for AI agents — one runtime that works across LangGraph, CrewAI, and OpenAI Agents.

CI PyPI version Python versions Coverage mypy strict License Runtime deps


The problem

AI agents crash. A process dies mid-workflow — after the LLM reasoned, after the tool charged a card, three steps into a ten-step plan — and all of that work is gone. You restart from zero, pay for the tokens again, and hope the tool doesn't fire its side effects twice. And when something goes wrong, you can't replay what happened to understand why.

The solution

Living AI records every step of an agent execution to an append-only log, so any run can be:

  • Recovered — resume from the last durable checkpoint after a crash, replaying only the idempotent work and never re-running side-effecting tool calls (payments, emails, API writes).
  • Replayed — re-run a recorded execution for debugging, with MOCK_TOOLS mode returning recorded tool responses so you can iterate on reasoning without real API calls.
  • Audited — inspect cost, latency, and the full node graph of any run.

Why it's different

Most observability and checkpointing tools lock you into one framework. Living AI ships a framework-agnostic core with thin adapters for all three major agent frameworks — the same recovery guarantees whether you use LangGraph, CrewAI, or the OpenAI Agents SDK:

from livingai.adapters import LangGraphAdapter, CrewAIAdapter, OpenAIAgentsAdapter

And the core has zero runtime dependencies — it's pure standard library.

Install

pip install livingai

Crash recovery in 18 lines

import asyncio
from livingai import (
    CheckpointEngine, ExecutionNode, NodeType, RecoveryEngine, SQLiteStore, Status,
)


async def main():
    engine = CheckpointEngine(SQLiteStore("agent.db"))

    # Your agent checkpoints after an expensive step.
    step = ExecutionNode(execution_id="run-1", type=NodeType.PROMPT,
                         status=Status.SUCCESS, output="plan ready")
    await engine.save(step, state=b"...serialized agent state...")

    # A tool with real side effects runs (e.g. charging a card).
    charge = ExecutionNode(execution_id="run-1", type=NodeType.TOOL,
                           status=Status.SUCCESS, output={"receipt": "R-1"})
    await engine.save(charge)

    # 💥 The process crashes. On restart, recover from the durable log:
    recovery = RecoveryEngine(CheckpointEngine(SQLiteStore("agent.db")))
    plan = await recovery.plan("run-1")
    print("resume from :", plan.resume_node_id)        # last durable checkpoint
    print("replay safe :", len(plan.replay_nodes))     # idempotent work to redo
    print("skip effects:", len(plan.skipped_nodes))    # card is NOT re-charged


asyncio.run(main())
resume from : d482c31e-...
replay safe : 0
skip effects: 1          # the card is never charged twice

The examples/ directory has five runnable demos (crash recovery, MOCK_TOOLS debugging, cost tracking, and the LangGraph adapter) — none require an LLM or network.

Performance

Checkpointing is on the hot path of every agent step, so it has to be fast. It is.

Metric Result Notes
Checkpoint write (p50) ~0.3 ms 50 KB compressed state blob
Checkpoint write (p95) ~0.8 ms
Checkpoint write (p99) ~1 ms ~50× under the 50 ms budget
Hot recovery read ~4 µs vs ~190 µs cold — ~40× faster
Compression 60–99% typical agent state (histories, docs)

Measured on a dev laptop with the default 50 ms overhead budget, 50 KB blobs, 2000 writes — the same configuration you get out of the box. Reproduce with python benchmarks/benchmark.py.

The overhead budget is enforced in code: a checkpoint write that would exceed it is dropped and logged as missed rather than ever blocking your agent thread.

How it works

ExecutionNode ──► CheckpointStore (Tier 2: durable, append-only)
      ▲                  ▲
      │                  │
 Adapters          CheckpointEngine ──► HotCache (Tier 1: LRU + TTL)
 (LangGraph/             │
  CrewAI/           RecoveryEngine ──► RecoveryPlan (replay vs. skip)
  OpenAI)           ReplaySession  ──► FULL / FROM_NODE / MOCK_TOOLS / COUNTERFACTUAL

Every execution is a DAG of ExecutionNode records. The log is never mutated, only appended to — so any point in time can be reconstructed deterministically. TOOL nodes default to non-idempotent, which is how recovery knows never to re-run side effects. See docs/concepts.md for the full model.

CLI

livingai list   --db agent.db          # execution ids
livingai show   run-1 --db agent.db    # the node graph
livingai replay run-1 --db agent.db --mode MOCK_TOOLS

Documentation

Quickstart · Concepts · Checkpointing · Recovery · Replay · CLI · Adapters · Migrating from other checkpointers · API Reference

Quality

  • 108 tests, 100% line coverage — including crash-simulation and stress tests (10k-node graphs, concurrent writers, write contention).
  • mypy --strict clean across all source files; ships py.typed.
  • CI matrix on Python 3.9–3.12 with a 100%-coverage gate.
pip install -e ".[dev]"
python -m pytest -q                    # run the suite
mypy --strict livingai                 # type check
python benchmarks/benchmark.py         # reproduce the numbers above

Design principles

Principle How
Zero-dependency core Standard library only (sqlite3, asyncio, zlib, dataclasses, uuid).
Append-only log Every write inserts a new row; nothing is mutated or deleted.
Framework-agnostic No framework imports in the core; framework data lives in metadata.
Async-first I/O Storage is async; sync SQLite runs off the event loop.
Bounded overhead Cold writes run under asyncio.wait_for; overruns are dropped, never blocking the agent.

Roadmap

Shipped: core data model, checkpoint engine, recovery engine, replay engine, CLI, LangGraph / CrewAI / OpenAI adapters, benchmarks, docs, Redis store, PostgreSQL store.

Optional backends — swap the default SQLite store for Redis or PostgreSQL with a single import (no core changes required):

pip install "livingai[redis]"     # hot Redis store
pip install "livingai[postgres]"  # PostgreSQL cold store
from livingai.stores.redis import RedisStore
from livingai.stores.postgres import PostgresStore

# Redis
engine = CheckpointEngine(RedisStore(url="redis://localhost:6379"))

# PostgreSQL
store = PostgresStore(dsn="postgresql://user:pass@localhost/livingai")
await store.initialize()          # creates tables once
engine = CheckpointEngine(store)

A Docker Compose dev stack (Postgres + Redis) ships with the repo:

docker compose up -d    # starts postgres:5432 + redis:6379

Next: FastAPI cloud backend (5 endpoints), cloud client (CloudSync), web replay dashboard.

Contributing

See CONTRIBUTING.md — development setup, running tests, code style, and how to add a new framework adapter or storage backend.

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

livingai-0.4.1.tar.gz (55.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

livingai-0.4.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file livingai-0.4.1.tar.gz.

File metadata

  • Download URL: livingai-0.4.1.tar.gz
  • Upload date:
  • Size: 55.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for livingai-0.4.1.tar.gz
Algorithm Hash digest
SHA256 74d114d127729abfdbe32a48c5d616e13c6db9a825941672ed4f125ba207aedc
MD5 5c06013868bdba85c913f09f9fd37946
BLAKE2b-256 b506472b91c30afedbf09cf1110b57e650f18e8f6e51fcc7647fefe89b652988

See more details on using hashes here.

File details

Details for the file livingai-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: livingai-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for livingai-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 148df4cb46d736dbd7263b2386768c8f75efe86478cb779c496ad47931e27b81
MD5 4b284ef504717eeb28a6b7ae04c3a85c
BLAKE2b-256 d55cbb5d2827a0754400ae116d5adb50b2f55bb6414d8f01dabd1a76ba7c5686

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page