Skip to main content

AI-first memory OS: semantic recall that returns reusable KV. One lifecycle memory, three front doors (self-host serve, closed-LLM proxy, MCP server).

Project description

Atelya OS — KV-native memory for self-hosted agents

Agent memory that lives in the model's KV cache, not just in text. Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.

pip install atelya

Most memory systems store text and re-prefill it into the prompt every turn — so the cost of serving memory grows linearly with turns. Atelya OS (amem) keeps the relevant working set as reused KV: compute it once, reuse it, don't recompute memory each turn.


What's measured (not estimated)

All numbers below were measured on a single RTX 4070 with bench_real.py / amem_headtohead.py in this repo. Full methodology, per-query data, and honest limits: BENCHMARK.md.

  • Fidelity — 97.5% answer-for-answer agreement with a cold full re-prefill of the same chunks (n = 200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
  • Cost — ~6x to ~54x less prefill. Reusing KV vs re-prefilling the same retrieved set is ~6x cheaper per query (n = 200, the conservative, default behavior); for a stable session the one-time working set amortizes to ~54x (measured, 30 queries). Your real number lands in that range depending on how much the relevant memory changes per query. Break-even ~1 query.
  • Quality — at parity, not a win. Head-to-head vs Mem0 at a matched answerer + injection budget, answer correctness was 60% (amem) vs 55% (Mem0) — within noise at n = 20. We do not claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and this comparison deliberately matches retrieval to isolate cost, so it does not reflect Mem0's stronger production recall pipeline.

Honest framing for the cost number: lead with ~6x (rigorous, n = 200). The ~54x is the best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.


Who this is for

  • Full fit — you self-host vLLM / SGLang on a CUDA GPU, with memory-heavy or long-horizon agents. You own inference, so you can inject KV -> you get the flat cost curve and the KV moat.
  • Partial fit — closed APIs (OpenAI) or Ollama. You can't inject KV into a model you don't control, so amem reduces the per-turn memory bill but can't flatten it. The memory SDK still helps; the KV-reuse cost curve does not apply.

Install

pip install atelya                 # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]'     # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuse

The ~6x–54x cost win needs the self-host engine ([selfhost]). pip install atelya alone is the memory-layer SDK; the KV moat requires inference you control. The Python package runs without the optional Rust engine (that engine is a commercial performance deepener, not required).


Quickstart

Verified against amem 0.1.18. kv-serve (the KV-reuse moat) needs amem[selfhost] + a CUDA GPU. No GPU? Swap in amem proxy (drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.

Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):

amem kv-serve            # serves on http://localhost:8000  (needs amem[selfhost] + a CUDA GPU)

Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling the memory:

from amem import Amem

mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
                       "Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice")                       # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid))   # different q -> reuses KV

Or drop it in behind the standard OpenAI SDK — point base_url at amem and pass user as the memory namespace (per-user residency is on by default):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="amem", user="alice",
    messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)

CLI

amem proxy        # closed-LLM drop-in (Claude / GPT) + cache orchestration   (light, no GPU)
amem mcp          # MCP server for Claude Desktop / Cursor / Claude Code       (stdio)
amem serve        # self-host serve (local open model + KV residency)          [amem[selfhost]]
amem kv-serve     # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench        # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version      # print version

Run amem --help for the authoritative list and flags.


How it works — the bridge

text memory                amem (the bridge)                 KV memory
re-prefilled every turn  -> recall the relevant set, then  -> its precomputed KV is REUSED
(linear cost)               reuse its KV (not the text)       (one-time prefill, then ~flat)
  • CacheBlend (via LMCache) reuses each chunk's attention KV position-independently, with ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV instead of a full re-prefill.
  • A value-model residency policy (relevance x recency x reuse - size) keeps the hottest memory resident and tiers the rest to CPU/disk.

What this is not

  • Not a recall-accuracy leaderboard claim. Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall; amem composes with a recall layer. Its edge is the cost of serving memory at parity fidelity.
  • Transformer-only. CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not apply to them.
  • A storage trade. KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk. You buy lower compute with more storage.

Status & license

Experimental, in active development — expect rough edges. Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atelya-0.1.19.tar.gz (73.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atelya-0.1.19-py3-none-any.whl (69.7 kB view details)

Uploaded Python 3

File details

Details for the file atelya-0.1.19.tar.gz.

File metadata

  • Download URL: atelya-0.1.19.tar.gz
  • Upload date:
  • Size: 73.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for atelya-0.1.19.tar.gz
Algorithm Hash digest
SHA256 10c57badbda63fe00a9aa85e976282dd9beac190de9c51c7596c4d4495a61d16
MD5 7f9d0b965ed20789ef51ba77d495aaa9
BLAKE2b-256 66454230261d92f1f3f2429da46f8bf8aa3610b014f22c48bc6372c2058235a2

See more details on using hashes here.

File details

Details for the file atelya-0.1.19-py3-none-any.whl.

File metadata

  • Download URL: atelya-0.1.19-py3-none-any.whl
  • Upload date:
  • Size: 69.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for atelya-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 b0f18d80c8d829f220654a23233dec02cd26c8da343e4ac3e13ceca26921ecb4
MD5 90c1eba778b395694840ab81eb64996c
BLAKE2b-256 89fc8395c8d8dd59dd60d59957d97d202985b59c8a9bacc28a441280d60dfc9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page