AI-first memory OS: semantic recall that returns reusable KV. One lifecycle memory, three front doors (self-host serve, closed-LLM proxy, MCP server).

These details have not been verified by PyPI

Project description

Atelya OS — KV-native memory for self-hosted agents

Agent memory that lives in the model's KV cache, not just in text. Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.

pip install atelya

Most memory systems store text and re-prefill it into the prompt every turn — so the cost of serving memory grows linearly with turns. Atelya OS (amem) keeps the relevant working set as reused KV: compute it once, reuse it, don't recompute memory each turn.

What's measured (not estimated)

All numbers below were measured on a single RTX 4070 with bench_real.py / amem_headtohead.py in this repo. Full methodology, per-query data, and honest limits: BENCHMARK.md.

Fidelity — 97.5% answer-for-answer agreement with a cold full re-prefill of the same chunks (n = 200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
Cost — ~6x to ~54x less prefill. Reusing KV vs re-prefilling the same retrieved set is ~6x cheaper per query (n = 200, the conservative, default behavior); for a stable session the one-time working set amortizes to ~54x (measured, 30 queries). Your real number lands in that range depending on how much the relevant memory changes per query. Break-even ~1 query.
Quality — at parity, not a win. Head-to-head vs Mem0 at a matched answerer + injection budget, answer correctness was 60% (amem) vs 55% (Mem0) — within noise at n = 20. We do not claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and this comparison deliberately matches retrieval to isolate cost, so it does not reflect Mem0's stronger production recall pipeline.

Honest framing for the cost number: lead with ~6x (rigorous, n = 200). The ~54x is the best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.

Who this is for

Full fit — you self-host vLLM / SGLang on a CUDA GPU, with memory-heavy or long-horizon agents. You own inference, so you can inject KV -> you get the flat cost curve and the KV moat.
Partial fit — closed APIs (OpenAI) or Ollama. You can't inject KV into a model you don't control, so amem reduces the per-turn memory bill but can't flatten it. The memory SDK still helps; the KV-reuse cost curve does not apply.

Install

pip install atelya                 # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]'     # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuse

The ~6x–54x cost win needs the self-host engine ([selfhost]). pip install atelya alone is the memory-layer SDK; the KV moat requires inference you control. The Python package runs without the optional Rust engine (that engine is a commercial performance deepener, not required).

Quickstart

Verified against amem 0.1.18. kv-serve (the KV-reuse moat) needs amem[selfhost] + a CUDA GPU. No GPU? Swap in amem proxy (drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.

Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):

amem kv-serve            # serves on http://localhost:8000  (needs amem[selfhost] + a CUDA GPU)

Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling the memory:

from amem import Amem

mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
                       "Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice")                       # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid))   # different q -> reuses KV

Or drop it in behind the standard OpenAI SDK — point base_url at amem and pass user as the memory namespace (per-user residency is on by default):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="amem", user="alice",
    messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)

CLI

amem proxy        # closed-LLM drop-in (Claude / GPT) + cache orchestration   (light, no GPU)
amem mcp          # MCP server for Claude Desktop / Cursor / Claude Code       (stdio)
amem serve        # self-host serve (local open model + KV residency)          [amem[selfhost]]
amem kv-serve     # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench        # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version      # print version

Run amem --help for the authoritative list and flags.

How it works — the bridge

text memory                amem (the bridge)                 KV memory
re-prefilled every turn  -> recall the relevant set, then  -> its precomputed KV is REUSED
(linear cost)               reuse its KV (not the text)       (one-time prefill, then ~flat)

CacheBlend (via LMCache) reuses each chunk's attention KV position-independently, with ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV instead of a full re-prefill.
A value-model residency policy (relevance x recency x reuse - size) keeps the hottest memory resident and tiers the rest to CPU/disk.

What this is not

Not a recall-accuracy leaderboard claim. Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall; amem composes with a recall layer. Its edge is the cost of serving memory at parity fidelity.
Transformer-only. CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not apply to them.
A storage trade. KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk. You buy lower compute with more storage.

Status & license

Experimental, in active development — expect rough edges. Apache-2.0.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.20

Jun 9, 2026

This version

0.1.19

Jun 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atelya-0.1.19.tar.gz (73.3 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

atelya-0.1.19-py3-none-any.whl (69.7 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file atelya-0.1.19.tar.gz.

File metadata

Download URL: atelya-0.1.19.tar.gz
Upload date: Jun 9, 2026
Size: 73.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for atelya-0.1.19.tar.gz
Algorithm	Hash digest
SHA256	`10c57badbda63fe00a9aa85e976282dd9beac190de9c51c7596c4d4495a61d16`
MD5	`7f9d0b965ed20789ef51ba77d495aaa9`
BLAKE2b-256	`66454230261d92f1f3f2429da46f8bf8aa3610b014f22c48bc6372c2058235a2`

See more details on using hashes here.

File details

Details for the file atelya-0.1.19-py3-none-any.whl.

File metadata

Download URL: atelya-0.1.19-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 69.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for atelya-0.1.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0f18d80c8d829f220654a23233dec02cd26c8da343e4ac3e13ceca26921ecb4`
MD5	`90c1eba778b395694840ab81eb64996c`
BLAKE2b-256	`89fc8395c8d8dd59dd60d59957d97d202985b59c8a9bacc28a441280d60dfc9f`

See more details on using hashes here.

atelya 0.1.19

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Atelya OS — KV-native memory for self-hosted agents

What's measured (not estimated)

Who this is for

Install

Quickstart

CLI

How it works — the bridge

What this is not

Status & license

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes