Async context manager middleware for LLM agents — solves Lost in the Middle via local Ollama compression.
Project description
Sawtooth Memory
Async hierarchical memory middleware for LLM agents.
Sawtooth Memory mitigates context-window degradation by continuously compressing older conversation state into structured long-term memory — without blocking the agent execution loop.
Instead of storing entire conversations indefinitely or relying purely on retrieval, Sawtooth maintains a layered memory model:
- recent messages remain verbatim
- important entities persist exactly
- older context compresses into narrative state
- compression runs asynchronously in the background
The result is bounded prompt growth with stable long-session behavior.
agent loop
│
▼
┌─────────────────────┐
│ ContextManager │
│ ┌───────────────┐ │
│ │ L0 System │ immutable persona + tool schemas
│ │ L2 Archive │ compressed narrative memory
│ │ L1.5 Entities │ exact IDs, paths, UUIDs
│ │ L1 Working │ recent raw conversation
│ └───────────────┘ │
└──────────┬──────────┘
│
▼
build_prompt()
│
▼
LLM API
The name "Sawtooth" comes from the token usage pattern created by periodic compression cycles.
Why This Exists
Long-running agents eventually fail for predictable reasons:
- context windows fill with stale history
- important information gets buried
- summarization loses exact values
- retrieval systems lose conversational continuity
- synchronous compression blocks the main loop
Most memory systems optimize for storage or retrieval.
Sawtooth optimizes for prompt survivability.
It continuously reshapes conversation history into a compact working context while preserving exact operational state separately.
Core Design
Sawtooth uses four memory tiers.
| Layer | Purpose | Characteristics |
|---|---|---|
| L0 | System memory | Immutable |
| L1 | Working memory | Recent verbatim messages |
| L1.5 | Entity ledger | Exact structured state |
| L2 | Archive memory | Compressed narrative history |
L0 — System Memory
Contains:
- system prompts
- tool schemas
- agent rules
- static instructions
L0 never changes.
L1 — Working Memory
Sliding window of recent raw conversation turns.
This is the active reasoning surface used by the model.
When token usage exceeds soft_limit_tokens, older messages are queued for asynchronous compression.
L1.5 — Entity Ledger
Structured exact-value persistence.
This layer exists because summarization is lossy.
Things that must remain exact:
- UUIDs
- database IDs
- file paths
- API keys references
- table names
- timestamps
- active resources
Example:
{
"active_connection": "conn_994a82",
"workspace_id": "ws_7f31",
"current_dataset": "sales_q3_2026"
}
L1.5 prevents critical operational state from disappearing into narrative summaries.
L2 — Archive Memory
Compressed long-horizon narrative memory.
Example:
User requested Q3 revenue analysis.
Agent connected to PostgreSQL.
Detected a 14% revenue decline in enterprise accounts.
Generated anomaly report and exported CSV.
L2 is append-only and optimized for semantic continuity rather than exact replay.
How Compression Works
Compression is asynchronous.
The main agent loop never waits for summarization.
When L1 exceeds the configured soft limit:
-
oldest messages are sliced into chunks
-
chunks are queued onto a background asyncio worker
-
noisy data is pruned
-
cleaned content is sent to a local Ollama model
-
extracted outputs are merged into:
- L2 narrative memory
- L1.5 entity state
-
original messages are removed from L1
This creates a repeating "sawtooth" token profile rather than monotonic prompt growth.
Design Goals
Bounded Prompt Growth
Prompt size remains stable during long-running sessions.
Non-Blocking Compression
Compression runs off the main execution path.
Failure Isolation
Compression failures never crash the agent loop.
Framework Agnostic
Works with any OpenAI-compatible SDK.
Local-First
All summarization can run entirely on local Ollama models.
Installation
pip install sawtooth-memory (coming soon)
From source:
git clone https://github.com/HtooTayZa/sawtooth-memory
cd sawtooth-memory
pip install -e ".[dev]"
Optional LangGraph support:
pip install -e ".[langgraph]"
Requirements:
- Python 3.11+
- Either a local Ollama instance running OR api keys for cloud backends (OpenAI, Anthropic, Gemini)
ollama serve
ollama pull phi4
Quick Start
import asyncio
from sawtooth_memory import (
ContextManager,
ContextManagerConfig,
)
config = ContextManagerConfig(
soft_limit_tokens=3000,
hard_limit_tokens=6000,
chunk_size=10,
)
async def main():
async with ContextManager(
system_prompt="You are a data analysis agent.",
config=config,
) as memory:
await memory.add_message(
"user",
"Analyze Q3 revenue trends."
)
await memory.add_message(
"assistant",
"Connecting to PostgreSQL."
)
await memory.add_message(
"tool",
'{"connection_id":"conn_994a82"}'
)
prompt = memory.build_prompt()
# response = await client.chat.completions.create(
# model="gpt-4o",
# messages=prompt,
# )
print(memory.get_stats())
asyncio.run(main())
Compiled Prompt Structure
build_prompt() returns standard OpenAI-format messages.
The system message is assembled dynamically:
[SYSTEM_L0]
You are a data analysis agent.
[ARCHIVE_L2]
User requested Q3 analysis.
Connected to PostgreSQL.
Detected revenue decline in enterprise segment.
[ENTITY_LEDGER_L1_5]
{
"connection_id": "conn_994a82",
"dataset": "sales_q3_2026"
}
Recent conversation turns remain verbatim beneath the system message.
Failure Handling
If compression fails:
- the agent loop continues
- the worker records a degradation event
- old messages may be truncated depending on configuration
By default:
fallback_truncate=True
This favors agent continuity over strict preservation.
Set:
fallback_truncate=False
to raise CompressionError instead.
What Sawtooth Is Not
Sawtooth is not:
- a vector database
- a retrieval framework
- a persistent knowledge graph
- a semantic search engine
- a replacement for RAG
It is prompt-state middleware.
Sawtooth manages conversational survivability inside bounded context windows.
It works alongside:
- RAG pipelines
- vector stores
- MCP tools
- LangGraph persistence
- external memory systems
Comparison
| System | Strategy | Compression | Exact State Layer | Async |
|---|---|---|---|---|
| ConversationSummaryMemory | Rolling summary | Yes | No | No |
| Mem0 | Retrieval memory | Partial | No | Partial |
| MemPalace | Verbatim retrieval | No | No | No |
| Sawtooth | Hierarchical compression | Yes | Yes | Yes |
When To Use Sawtooth
Good fit:
- long-running autonomous agents
- coding agents
- research agents
- multi-tool workflows
- persistent orchestration loops
- local-first agent stacks
Probably unnecessary:
- short chats
- single-shot tasks
- stateless pipelines
- retrieval-heavy systems with minimal dialogue state
To cleanly update your Configuration section in the README.md to reflect the newly added multi-provider cloud support, you can replace that entire section with the following fully updated documentation.
It now showcases both the local-first Ollama path and the new production-ready cloud path side by side, making it clear and complete for your users.
Configuration
Sawtooth Memory is configured using Pydantic models. You can back your context compression loop with either a local Ollama stack or cloud frontier models (OpenAI, Anthropic, or Gemini).
Local Backend (Ollama)
To run entirely on local hardware, pass an OllamaConfig block.
from sawtooth_memory import (
ContextManagerConfig,
OllamaConfig,
)
config = ContextManagerConfig(
soft_limit_tokens=3000,
hard_limit_tokens=6000,
chunk_size=10,
tokenizer_model="gpt-4o",
fallback_truncate=True,
ollama=OllamaConfig(
base_url="http://localhost:11434",
model="phi4",
timeout_seconds=90,
),
)
Cloud Backend (OpenAI, Anthropic, Gemini)
To offload background compression tasks to a cloud API provider, configure a CloudConfig block instead. This mode utilizes native structured outputs and built-in exponential backoff for HTTP 429 rate limits.
from sawtooth_memory import ContextManagerConfig
from sawtooth_memory.config import CloudConfig, Provider
config = ContextManagerConfig(
soft_limit_tokens=3000,
hard_limit_tokens=6000,
chunk_size=10,
fallback_truncate=True,
# Configure any supported provider: Provider.OPENAI, Provider.ANTHROPIC, or Provider.GEMINI
cloud=CloudConfig(
provider=Provider.ANTHROPIC,
model="claude-3-5-haiku-latest",
api_key="your-api-key-here",
timeout_seconds=60,
# base_url is optional: use to route via Helicone, LiteLLM, or Azure OpenAI
base_url=None,
),
)
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
soft_limit_tokens |
int |
3000 |
Token threshold that triggers background conversation compression. |
hard_limit_tokens |
int |
6000 |
Maximum token window size allowed before strict enforcement occurs. |
chunk_size |
int |
10 |
Number of older conversation messages sliced off into each compression worker chunk. |
tokenizer_model |
str |
"gpt-4o" |
Tokenizer encoding scheme utilized for active memory tracking calculation. |
fallback_truncate |
bool |
True |
If True, falls back to tracking-truncation strings when compression fails, ensuring continuity. |
ollama |
OllamaConfig |
Factory | Active backend properties dedicated to your local Ollama runtime loop. |
cloud |
CloudConfig |
None |
Active properties dedicated to Cloud API orchestration rules. |
Roadmap
- LangGraph adapter
- AutoGen adapter
- Redis-backed worker transport
- Adaptive salience scoring
- Recursive archive compression
- Hybrid retrieval integration
- Prometheus metrics
- TypeScript implementation
Repository Structure
sawtooth-memory/
├── .github/
│ └── workflows/
│ └── test.yml # CI test pipeline
│
├── sawtooth_memory/
│ ├── integrations/
│ │ └── langgraph/
│ │ ├── adapter.py # LangGraph adapter layer
│ │ └── graph.py # Graph state definitions
│ │
│ ├── providers/
│ │ ├── __init__.py
│ │ ├── adapter.py
│ │ ├── compressor.py
│ │ └── factory.py
│ │
│ ├── compressor.py # Compression + summarization pipeline
│ ├── config.py # Configuration models
│ ├── exceptions.py # Custom exceptions
│ ├── middleware.py # Context middleware entrypoint
│ ├── monitor.py # Telemetry and runtime monitoring
│ ├── state.py # Memory tier state management
│ └── worker.py # Background compression worker
│
├── tests/
│ ├── conftest.py
│ ├── test_adapter.py
│ ├── test_compressor.py
│ ├── test_graph.py
│ ├── test_middleware.py
│ ├── test_monitor.py
│ └── test_state.py
│
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── pyproject.toml
├── README.md
└── SECURITY.md
Development
pytest
ruff check .
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sawtooth_memory-0.1.0.tar.gz.
File metadata
- Download URL: sawtooth_memory-0.1.0.tar.gz
- Upload date:
- Size: 47.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
111a863252daed813eb81600066d8655e130a3caacaee525cec6840fa350c487
|
|
| MD5 |
49fde2ec40a7cd393800374863076d8d
|
|
| BLAKE2b-256 |
044d2412c62430967fa397a55237ff6702736d0323a868d5506806fea48024d3
|
File details
Details for the file sawtooth_memory-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sawtooth_memory-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0afe40ab287fefcfd6f668be616aa7b39330f685919edda85ceecda645186c
|
|
| MD5 |
f5900f7caa920c9f0654ead5f8ee53d4
|
|
| BLAKE2b-256 |
e6f03077657c5c237092bebdfab1ee0e1345ae1549ad20e9d40b275a0a8f3cba
|