AI-first memory OS: semantic recall that returns reusable KV. One lifecycle memory, three front doors (self-host serve, closed-LLM proxy, MCP server).
Project description
Atelya OS — KV-native memory for self-hosted agents
Agent memory that lives in the model's KV cache, not just in text. Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.
pip install atelya
Note: the PyPI package is
atelya; the Python import and CLI areamem.
Most memory systems store text and re-prefill it into the prompt every turn — so the cost of
serving memory grows linearly with turns. Atelya OS (amem) keeps the relevant working set as
reused KV: compute it once, reuse it, don't recompute memory each turn.
What's measured (not estimated)
All numbers below were measured on a single RTX 4070 with bench_real.py / amem_headtohead.py in
this repo. Full methodology, per-query data, and honest limits: BENCHMARK.md.
- Fidelity — 97.5% answer-for-answer agreement with a cold full re-prefill of the same chunks (n = 200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
- Cost — ~6x to ~54x less prefill. Reusing KV vs re-prefilling the same retrieved set is ~6x cheaper per query (n = 200, the conservative, default behavior); for a stable session the one-time working set amortizes to ~54x (measured, 30 queries). Your real number lands in that range depending on how much the relevant memory changes per query. Break-even ~1 query.
- Quality — at parity, not a win. Head-to-head vs Mem0 at a matched answerer + injection budget, answer correctness was 60% (amem) vs 55% (Mem0) — within noise at n = 20. We do not claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and this comparison deliberately matches retrieval to isolate cost, so it does not reflect Mem0's stronger production recall pipeline.
Honest framing for the cost number: lead with ~6x (rigorous, n = 200). The ~54x is the best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.
Who this is for
- Full fit — you self-host vLLM / SGLang on a CUDA GPU, with memory-heavy or long-horizon agents. You own inference, so you can inject KV -> you get the flat cost curve and the KV moat.
- Partial fit — closed APIs (OpenAI) or Ollama. You can't inject KV into a model you don't control, so amem reduces the per-turn memory bill but can't flatten it. The memory SDK still helps; the KV-reuse cost curve does not apply.
Install
pip install atelya # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]' # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuse
The ~6x–54x cost win needs the self-host engine ([selfhost]). pip install atelya alone is the
memory-layer SDK; the KV moat requires inference you control. The Python package runs without the
optional Rust engine (that engine is a commercial performance deepener, not required).
Quickstart
Verified against amem 0.1.19.
kv-serve(the KV-reuse moat) needsamem[selfhost]+ a CUDA GPU. No GPU? Swap inamem proxy(drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.
Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):
amem kv-serve # serves on http://localhost:8000 (needs amem[selfhost] + a CUDA GPU)
Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling the memory:
from amem import Amem
mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
"Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice") # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid)) # different q -> reuses KV
Or drop it in behind the standard OpenAI SDK — point base_url at amem and pass user as the
memory namespace (per-user residency is on by default):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
model="amem", user="alice",
messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)
CLI
amem proxy # closed-LLM drop-in (Claude / GPT) + cache orchestration (light, no GPU)
amem mcp # MCP server for Claude Desktop / Cursor / Claude Code (stdio)
amem serve # self-host serve (local open model + KV residency) [amem[selfhost]]
amem kv-serve # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version # print version
Run amem --help for the authoritative list and flags.
How it works — the bridge
text memory amem (the bridge) KV memory
re-prefilled every turn -> recall the relevant set, then -> its precomputed KV is REUSED
(linear cost) reuse its KV (not the text) (one-time prefill, then ~flat)
- CacheBlend (via LMCache) reuses each chunk's attention KV position-independently, with ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV instead of a full re-prefill.
- A value-model residency policy (relevance x recency x reuse - size) keeps the hottest memory resident and tiers the rest to CPU/disk.
What this is not
- Not a recall-accuracy leaderboard claim. Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall; amem composes with a recall layer. Its edge is the cost of serving memory at parity fidelity.
- Transformer-only. CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not apply to them.
- A storage trade. KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk. You buy lower compute with more storage.
Status & license
Experimental, in active development — expect rough edges. Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atelya-0.1.20.tar.gz.
File metadata
- Download URL: atelya-0.1.20.tar.gz
- Upload date:
- Size: 73.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62c87784451dfa3841139275341877e8b0ac4e743d0a8a416f4d3585add3e147
|
|
| MD5 |
683a196cb1b21027cbff8e7525eff2b3
|
|
| BLAKE2b-256 |
2636a9cd362ec1f39c1f88dd5030aacc979fde0e1dcff5f83748948b900b7322
|
File details
Details for the file atelya-0.1.20-py3-none-any.whl.
File metadata
- Download URL: atelya-0.1.20-py3-none-any.whl
- Upload date:
- Size: 69.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3bf4bed31f69146b2832d6e1f590804cfea51b0a474a597fddda3665b082c52
|
|
| MD5 |
4f1f8e03757aa51f4ecca15cc94813fe
|
|
| BLAKE2b-256 |
f394010db7e80ab5eced7a89de4a56d1fcfb7d61955021e93d353486f687b872
|