Async context manager middleware for LLM agents — solves Lost in the Middle via local Ollama compression.

These details have not been verified by PyPI

Project description

Sawtooth Memory

Async hierarchical memory middleware for LLM agents.

Sawtooth Memory mitigates context-window degradation by continuously compressing older conversation state into structured long-term memory — without blocking the agent execution loop.

Instead of storing entire conversations indefinitely or relying purely on retrieval, Sawtooth maintains a layered memory model:

recent messages remain verbatim
important entities persist exactly
older context compresses into narrative state
compression runs asynchronously in the background

The result is bounded prompt growth with stable long-session behavior.

agent loop
    │
    ▼
┌─────────────────────┐
│   ContextManager    │
│  ┌───────────────┐  │
│  │ L0 System     │  immutable persona + tool schemas
│  │ L2 Archive    │  compressed narrative memory
│  │ L1.5 Entities │  exact IDs, paths, UUIDs
│  │ L1 Working    │  recent raw conversation
│  └───────────────┘  │
└──────────┬──────────┘
           │
           ▼
      build_prompt()
           │
           ▼
        LLM API

The name "Sawtooth" comes from the token usage pattern created by periodic compression cycles.

Why This Exists

Long-running agents eventually fail for predictable reasons:

context windows fill with stale history
important information gets buried
summarization loses exact values
retrieval systems lose conversational continuity
synchronous compression blocks the main loop

Most memory systems optimize for storage or retrieval.

Sawtooth optimizes for prompt survivability.

It continuously reshapes conversation history into a compact working context while preserving exact operational state separately.

Core Design

Sawtooth uses four memory tiers.

Layer	Purpose	Characteristics
L0	System memory	Immutable
L1	Working memory	Recent verbatim messages
L1.5	Entity ledger	Exact structured state
L2	Archive memory	Compressed narrative history

L0 — System Memory

Contains:

system prompts
tool schemas
agent rules
static instructions

L0 never changes.

L1 — Working Memory

Sliding window of recent raw conversation turns.

This is the active reasoning surface used by the model.

When token usage exceeds soft_limit_tokens, older messages are queued for asynchronous compression.

L1.5 — Entity Ledger

Structured exact-value persistence.

This layer exists because summarization is lossy.

Things that must remain exact:

UUIDs
database IDs
file paths
API keys references
table names
timestamps
active resources

Example:

{
  "active_connection": "conn_994a82",
  "workspace_id": "ws_7f31",
  "current_dataset": "sales_q3_2026"
}

L1.5 prevents critical operational state from disappearing into narrative summaries.

L2 — Archive Memory

Compressed long-horizon narrative memory.

Example:

User requested Q3 revenue analysis.
Agent connected to PostgreSQL.
Detected a 14% revenue decline in enterprise accounts.
Generated anomaly report and exported CSV.

L2 is append-only and optimized for semantic continuity rather than exact replay.

How Compression Works

Compression is asynchronous.

The main agent loop never waits for summarization.

When L1 exceeds the configured soft limit:

oldest messages are sliced into chunks
chunks are queued onto a background asyncio worker
noisy data is pruned
cleaned content is sent to a local Ollama model
extracted outputs are merged into:
- L2 narrative memory
- L1.5 entity state
original messages are removed from L1

This creates a repeating "sawtooth" token profile rather than monotonic prompt growth.

Design Goals

Bounded Prompt Growth

Prompt size remains stable during long-running sessions.

Non-Blocking Compression

Compression runs off the main execution path.

Failure Isolation

Compression failures never crash the agent loop.

Framework Agnostic

Works with any OpenAI-compatible SDK.

Local-First

All summarization can run entirely on local Ollama models.

Installation

pip install sawtooth-memory (coming soon)

From source:

git clone https://github.com/HtooTayZa/sawtooth-memory
cd sawtooth-memory
pip install -e ".[dev]"

Optional LangGraph support:

pip install -e ".[langgraph]"

Requirements:

Python 3.11+
Either a local Ollama instance running OR api keys for cloud backends (OpenAI, Anthropic, Gemini)

ollama serve
ollama pull phi4

Quick Start

import asyncio

from sawtooth_memory import (
    ContextManager,
    ContextManagerConfig,
)

config = ContextManagerConfig(
    soft_limit_tokens=3000,
    hard_limit_tokens=6000,
    chunk_size=10,
)

async def main():

    async with ContextManager(
        system_prompt="You are a data analysis agent.",
        config=config,
    ) as memory:

        await memory.add_message(
            "user",
            "Analyze Q3 revenue trends."
        )

        await memory.add_message(
            "assistant",
            "Connecting to PostgreSQL."
        )

        await memory.add_message(
            "tool",
            '{"connection_id":"conn_994a82"}'
        )

        prompt = memory.build_prompt()

        # response = await client.chat.completions.create(
        #     model="gpt-4o",
        #     messages=prompt,
        # )

        print(memory.get_stats())

asyncio.run(main())

Compiled Prompt Structure

build_prompt() returns standard OpenAI-format messages.

The system message is assembled dynamically:

[SYSTEM_L0]
You are a data analysis agent.

[ARCHIVE_L2]
User requested Q3 analysis.
Connected to PostgreSQL.
Detected revenue decline in enterprise segment.

[ENTITY_LEDGER_L1_5]
{
  "connection_id": "conn_994a82",
  "dataset": "sales_q3_2026"
}

Recent conversation turns remain verbatim beneath the system message.

Failure Handling

If compression fails:

the agent loop continues
the worker records a degradation event
old messages may be truncated depending on configuration

By default:

fallback_truncate=True

This favors agent continuity over strict preservation.

Set:

fallback_truncate=False

to raise CompressionError instead.

What Sawtooth Is Not

Sawtooth is not:

a vector database
a retrieval framework
a persistent knowledge graph
a semantic search engine
a replacement for RAG

It is prompt-state middleware.

Sawtooth manages conversational survivability inside bounded context windows.

It works alongside:

RAG pipelines
vector stores
MCP tools
LangGraph persistence
external memory systems

Comparison

System	Strategy	Compression	Exact State Layer	Async
ConversationSummaryMemory	Rolling summary	Yes	No	No
Mem0	Retrieval memory	Partial	No	Partial
MemPalace	Verbatim retrieval	No	No	No
Sawtooth	Hierarchical compression	Yes	Yes	Yes

When To Use Sawtooth

Good fit:

long-running autonomous agents
coding agents
research agents
multi-tool workflows
persistent orchestration loops
local-first agent stacks

Probably unnecessary:

short chats
single-shot tasks
stateless pipelines
retrieval-heavy systems with minimal dialogue state

To cleanly update your Configuration section in the README.md to reflect the newly added multi-provider cloud support, you can replace that entire section with the following fully updated documentation.

It now showcases both the local-first Ollama path and the new production-ready cloud path side by side, making it clear and complete for your users.

Configuration

Sawtooth Memory is configured using Pydantic models. You can back your context compression loop with either a local Ollama stack or cloud frontier models (OpenAI, Anthropic, or Gemini).

Local Backend (Ollama)

To run entirely on local hardware, pass an OllamaConfig block.

from sawtooth_memory import (
    ContextManagerConfig,
    OllamaConfig,
)

config = ContextManagerConfig(
    soft_limit_tokens=3000,
    hard_limit_tokens=6000,
    chunk_size=10,
    tokenizer_model="gpt-4o",
    fallback_truncate=True,

    ollama=OllamaConfig(
        base_url="http://localhost:11434",
        model="phi4",
        timeout_seconds=90,
    ),
)

Cloud Backend (OpenAI, Anthropic, Gemini)

To offload background compression tasks to a cloud API provider, configure a CloudConfig block instead. This mode utilizes native structured outputs and built-in exponential backoff for HTTP 429 rate limits.

from sawtooth_memory import ContextManagerConfig
from sawtooth_memory.config import CloudConfig, Provider

config = ContextManagerConfig(
    soft_limit_tokens=3000,
    hard_limit_tokens=6000,
    chunk_size=10,
    fallback_truncate=True,

    # Configure any supported provider: Provider.OPENAI, Provider.ANTHROPIC, or Provider.GEMINI
    cloud=CloudConfig(
        provider=Provider.ANTHROPIC,
        model="claude-3-5-haiku-latest",
        api_key="your-api-key-here",
        timeout_seconds=60,
        # base_url is optional: use to route via Helicone, LiteLLM, or Azure OpenAI
        base_url=None,
    ),
)

Configuration Parameters

Parameter	Type	Default	Description
`soft_limit_tokens`	`int`	`3000`	Token threshold that triggers background conversation compression.
`hard_limit_tokens`	`int`	`6000`	Maximum token window size allowed before strict enforcement occurs.
`chunk_size`	`int`	`10`	Number of older conversation messages sliced off into each compression worker chunk.
`tokenizer_model`	`str`	`"gpt-4o"`	Tokenizer encoding scheme utilized for active memory tracking calculation.
`fallback_truncate`	`bool`	`True`	If `True`, falls back to tracking-truncation strings when compression fails, ensuring continuity.
`ollama`	`OllamaConfig`	Factory	Active backend properties dedicated to your local Ollama runtime loop.
`cloud`	`CloudConfig`	`None`	Active properties dedicated to Cloud API orchestration rules.

Roadmap

LangGraph adapter
AutoGen adapter
Redis-backed worker transport
Adaptive salience scoring
Recursive archive compression
Hybrid retrieval integration
Prometheus metrics
TypeScript implementation

Repository Structure

sawtooth-memory/
├── .github/
│   └── workflows/
│       └── test.yml                # CI test pipeline
│
├── sawtooth_memory/
│   ├── integrations/
│   │   └── langgraph/
│   │       ├── adapter.py          # LangGraph adapter layer
│   │       └── graph.py            # Graph state definitions
│   │
│   ├── providers/
│   │   ├── __init__.py
│   │   ├── adapter.py
│   │   ├── compressor.py
│   │   └── factory.py
│   │
│   ├── compressor.py               # Compression + summarization pipeline
│   ├── config.py                   # Configuration models
│   ├── exceptions.py               # Custom exceptions
│   ├── middleware.py               # Context middleware entrypoint
│   ├── monitor.py                  # Telemetry and runtime monitoring
│   ├── state.py                    # Memory tier state management
│   └── worker.py                   # Background compression worker
│
├── tests/
│   ├── conftest.py
│   ├── test_adapter.py
│   ├── test_compressor.py
│   ├── test_graph.py
│   ├── test_middleware.py
│   ├── test_monitor.py
│   └── test_state.py
│
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── pyproject.toml
├── README.md
└── SECURITY.md

Development

pytest
ruff check .

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 6, 2026

0.1.1

May 31, 2026

This version

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sawtooth_memory-0.1.0.tar.gz (47.7 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sawtooth_memory-0.1.0-py3-none-any.whl (34.3 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file sawtooth_memory-0.1.0.tar.gz.

File metadata

Download URL: sawtooth_memory-0.1.0.tar.gz
Upload date: May 31, 2026
Size: 47.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for sawtooth_memory-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`111a863252daed813eb81600066d8655e130a3caacaee525cec6840fa350c487`
MD5	`49fde2ec40a7cd393800374863076d8d`
BLAKE2b-256	`044d2412c62430967fa397a55237ff6702736d0323a868d5506806fea48024d3`

See more details on using hashes here.

File details

Details for the file sawtooth_memory-0.1.0-py3-none-any.whl.

File metadata

Download URL: sawtooth_memory-0.1.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for sawtooth_memory-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e0afe40ab287fefcfd6f668be616aa7b39330f685919edda85ceecda645186c`
MD5	`f5900f7caa920c9f0654ead5f8ee53d4`
BLAKE2b-256	`e6f03077657c5c237092bebdfab1ee0e1345ae1549ad20e9d40b275a0a8f3cba`

See more details on using hashes here.

sawtooth-memory 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Sawtooth Memory

Why This Exists

Core Design

L0 — System Memory

L1 — Working Memory

L1.5 — Entity Ledger

L2 — Archive Memory

How Compression Works

Design Goals

Bounded Prompt Growth

Non-Blocking Compression

Failure Isolation

Framework Agnostic

Local-First

Installation

Quick Start

Compiled Prompt Structure

Failure Handling

What Sawtooth Is Not

Comparison

When To Use Sawtooth

Configuration

Local Backend (Ollama)

Cloud Backend (OpenAI, Anthropic, Gemini)

Configuration Parameters

Roadmap

Repository Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes