Context optimization and associative memory for LLM applications

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tkenaz

These details have not been verified by PyPI

Project description

breathe-memory

Context optimization and associative memory for LLM applications.

Two-phase system built around how memory actually works — not as lookup, but as association.

pip install breathe-memory

What it does

LLMs forget. Context windows are finite and expensive. Most solutions either stuff everything in (burns tokens) or summarize (loses structure).

BREATHE does neither:

SYNAPSE (inhale) — before each generation, extracts associative anchors from the user message and injects semantically relevant memories directly into the prompt. The LLM starts thinking with context already loaded. Overhead: 2–20ms.
GraphCompactor (exhale) — when context fills up, extracts a structured graph (topics, decisions, open questions, artifacts) instead of a lossy narrative summary. Typically saves 60–80% of tokens while preserving semantic structure.

                    ┌─────────────────────────────────────┐
    User message ──▶│           SYNAPSE (inhale)          │
                    │                                     │
                    │  1. Extract anchors (regex, 2ms)    │
                    │  2. Traverse memory graph (BFS)     │
                    │  3. Vector search (optional)        │
                    │  4. Inject <associative_memory>     │
                    └──────────────────┬──────────────────┘
                                       │
                                       ▼
                              LLM with memory context
                                       │
                    ┌──────────────────▼──────────────────┐
                    │        GraphCompactor (exhale)       │
                    │     (fires when context ~80% full)   │
                    │                                     │
                    │  Compressible messages ──▶ LLM call │
                    │     → Topics, Decisions, Open,      │
                    │       Artifacts, Context, Dropped   │
                    │                                     │
                    │  Protected messages ──▶ kept intact │
                    └─────────────────────────────────────┘

Quick start

import asyncio
from breathe import Synapse, GraphCompactor, BreatheConfig
from breathe.interfaces import MemoryRepository, LLMClient, RetrievedNode

# Implement these two interfaces for your backend
class MyMemoryRepo(MemoryRepository):
    async def get_concepts(self):
        return {"FastAPI": "uuid-001", "Redis": "uuid-002"}

    async def graph_bfs(self, start_ids, **kwargs):
        return []  # implement BFS against your DB

    async def keyword_search(self, keywords, limit=5):
        return []  # implement ILIKE against your memories table

class MyLLMClient(LLMClient):
    async def complete(self, prompt, max_tokens=4000, temperature=0.2):
        # call your LLM API here
        ...

async def main():
    config = BreatheConfig()
    synapse = Synapse(repository=MyMemoryRepo(), config=config)
    await synapse.initialize()

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "How should I structure my FastAPI endpoints?"},
    ]

    # Inject associative memory before each LLM call
    messages = await synapse.inject(messages)

    # When context gets full, compress with GraphCompactor
    compactor = GraphCompactor(llm_client=MyLLMClient())
    result = await compactor.compress(messages)
    messages = result["compressed_messages"]

asyncio.run(main())

With Memory Nexus (PostgreSQL + pgvector)

from breathe import Synapse, BreatheConfig
from memory_nexus import PostgresMemoryStore

store = PostgresMemoryStore(dsn="postgresql://localhost/mydb")
await store.initialize()

# Store memories
await store.store("FastAPI handles async requests efficiently")
await store.store("Redis is ideal for session storage and caching")

# Wire into SYNAPSE — store implements VectorSearchClient
synapse = Synapse(vector_client=store, config=BreatheConfig())
await synapse.initialize()

messages = await synapse.inject(messages)

PostgreSQL schema (default — 384-dim):

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE memories (
    id TEXT PRIMARY KEY DEFAULT gen_random_uuid()::text,
    content TEXT NOT NULL,
    embedding vector(384),
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON memories USING ivfflat (embedding vector_cosine_ops);

Embedding models:

The default model (all-MiniLM-L6-v2, 384-dim, ~90 MB) is good for prototyping. For production, we recommend intfloat/multilingual-e5-large (1024-dim, ~1.2 GB) — significantly better retrieval quality, especially for multilingual content.

To switch, pass model_name and adjust your table's vector dimension:

store = PostgresMemoryStore(
    dsn="postgresql://localhost/mydb",
    model_name="intfloat/multilingual-e5-large",  # 1024-dim, multilingual
)

-- For e5-large, use vector(1024) instead of vector(384)
CREATE TABLE memories (
    ...
    embedding vector(1024),
    ...
);

Language support

Built-in: English. Custom languages in ~10 lines:

import re
from breathe import Synapse, BreatheConfig, LanguagePack

GERMAN = LanguagePack(
    code="de",
    stopwords=frozenset({"der", "die", "das", "und", "ist", ...}),
    hub_exclusions=frozenset({"system", "speicher"}),
    temporal_pattern=re.compile(r"\b(gestern|heute|morgen|neulich)\b", re.I),
    emotional_pattern=re.compile(r"\b(müde|glücklich|traurig|wütend)\b", re.I),
    labels={"themes": "Themen", "insights": "Erkenntnisse"},
)

config = BreatheConfig(language_packs=[GERMAN], default_language="de")
synapse = Synapse(config=config, ...)

Language packs control:

Stopwords — excluded from relevance scoring
Hub exclusions — nodes too generic to be useful for injection (e.g. "system", "memory"). Add your most frequent root concepts here — words that connect to everything are noise in retrieval. The more specific your exclusions, the sharper your injections.
Temporal and emotional regex patterns — anchor extraction for time references and emotional signals
UI section labels — headers used in the injected <associative_memory> block

Architecture

SYNAPSE pipeline (per-request, <200ms)

User message
     │
     ▼
AnchorExtractor
  ├─ Match known concepts (regex, 0.9 confidence)
  ├─ Temporal patterns   (0.7)
  ├─ Technical patterns  (0.5)
  └─ Emotional signals   (0.6)
     │
     ▼ [optional Phase 3 — Apple Silicon only]
ModelAnchorExtractor (local LLM via MLX, ~250ms)
  └─ Fires only when regex finds <5 matched nodes
     │
     ▼
Three traversal strategies (in parallel):
  1. Graph BFS    ── memory_nodes + memory_edges (recursive CTE)
  2. Vector search── any VectorSearchClient (pgvector, Pinecone, etc.)
  3. Keyword search── ILIKE on unmatched anchors
     │
     ▼
Relevance filter
  ├─ Hub exclusion (drop super-generic nodes)
  ├─ Session dedup (skip already-injected nodes)
  └─ Keyword overlap scoring (anchor words vs node content)
     │
     ▼
ContextInjector
  └─ <associative_memory> block → prepended to last user message

GraphCompactor (when context fills up)

Old messages (compressible zone)
     │
     ▼ preprocess: compress tool call JSON
     ▼
LLM extraction call (your LLMClient)
     │
     ▼
SessionGraph: Topics / Decisions / Open / Artifacts / Context / Dropped
     │
     ▼
[SESSION GRAPH] message + protected recent messages

Configuration

from breathe import BreatheConfig
from breathe.config import ENGLISH

config = BreatheConfig(
    # Language packs (all active simultaneously)
    language_packs=[ENGLISH],
    default_language="en",

    # SYNAPSE tuning
    min_similarity=0.55,       # min vector similarity to accept
    max_injected_nodes=15,     # max nodes per injection
    enable_model_extractor=True,
    model_trigger_threshold=5, # model fires when regex finds <5 nodes

    # Token budgets by conversation mode
    mode_budgets={
        "casual":   1500,
        "work":     2500,
        "deep":     4000,
        "balanced": 2000,
    },

    # GraphCompactor
    compactor_model="claude-sonnet-4-20250514",
    compactor_fallback_model="claude-haiku-4-5-20251001",
    min_tokens_to_compress=300,
    protected_messages_normal=10,
)

Implementing backends

MemoryRepository (for graph BFS + keyword search)

from breathe.interfaces import MemoryRepository, RetrievedNode

class MyRepo(MemoryRepository):
    async def get_concepts(self) -> dict[str, str]:
        # Return {concept_text: uuid} from your knowledge graph
        return {"Redis": "abc-123", "FastAPI": "def-456"}

    async def graph_bfs(self, start_ids, max_depth=2, **kwargs) -> list[RetrievedNode]:
        # BFS from start_ids through your concept graph
        # Recursive CTE on (memory_nodes, memory_edges) works well
        ...

    async def keyword_search(self, keywords, limit=5) -> list[RetrievedNode]:
        # ILIKE search over your memories/documents table
        ...

    async def flush_edges(self, edges) -> int:
        # Optional: persist new session graph edges to long-term storage
        return 0

VectorSearchClient (for semantic search)

from breathe.interfaces import VectorSearchClient, RetrievedNode

class PineconeClient(VectorSearchClient):
    async def search(self, query: str, limit: int = 5) -> list[RetrievedNode]:
        # embed query, search your vector index, return RetrievedNode list
        ...

LLMClient (for GraphCompactor)

from breathe.interfaces import LLMClient

class AnthropicClient(LLMClient):
    def __init__(self, api_key: str):
        import anthropic
        self._client = anthropic.AsyncAnthropic(api_key=api_key)

    async def complete(self, prompt, max_tokens=4000, temperature=0.2):
        msg = await self._client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=max_tokens,
            temperature=temperature,
            messages=[{"role": "user", "content": prompt}],
        )
        return msg.content[0].text

class OpenAIClient(LLMClient):
    async def complete(self, prompt, max_tokens=4000, temperature=0.2):
        from openai import AsyncOpenAI
        client = AsyncOpenAI()
        resp = await client.chat.completions.create(
            model="gpt-4o",
            max_tokens=max_tokens,
            temperature=temperature,
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.choices[0].message.content

Performance

Measured in production on Apple M2 Max:

Component	Latency	Notes
Regex extraction	2ms	always runs
MLX model extraction	~250ms	conditional (when regex < 5 matches)
Graph BFS (PG)	5–15ms	recursive CTE, depth=2
Vector search (pgvector)	10–30ms	depends on index size
Keyword search (ILIKE)	3–10ms	depends on table size
Total SYNAPSE	2–60ms	without model
Total SYNAPSE	~300ms	with model
GraphCompactor	3–8s	one LLM call, happens rarely

GraphCompactor fires infrequently (only at ~80% context fill), so its latency doesn't affect per-request response time.

Memory management

BREATHE handles retrieval and injection automatically. Storing memories is your application's responsibility — you decide what to remember and when.

# Your application stores memories explicitly
await store.store("User prefers dark mode and concise answers")
await store.store("Project uses FastAPI + PostgreSQL + Redis stack")

# SYNAPSE retrieves relevant ones automatically before each LLM call
messages = await synapse.inject(messages)

This is intentional: memory storage policies (what to keep, when to forget, privacy rules) vary wildly between applications. BREATHE gives you the retrieval engine — you control the data.

Coming soon: A standalone MCP server wrapping Memory Nexus, so LLMs can store and search memories directly as tool calls.

Optional dependencies

# PostgreSQL + pgvector backend
pip install breathe-memory[pg]

# Apple Silicon local model extractor (MLX)
pip install breathe-memory[mlx]

# Anthropic client for GraphCompactor
pip install breathe-memory[anthropic]

# OpenAI client for GraphCompactor
pip install breathe-memory[openai]

# Everything
pip install breathe-memory[all]

Core package has zero dependencies beyond Python stdlib + typing-extensions.

Model extractor (Phase 3)

The optional ModelAnchorExtractor uses MLX to run a small local LLM for contextual anchor extraction when regex alone isn't enough.

This requires Apple Silicon (M1/M2/M3/M4). MLX is an Apple-only framework and will not work on Linux or Windows. If MLX is not installed, the model extractor is silently skipped — everything else works normally.

The default model is Qwen3-1.7B (4-bit, ~1.2 GB RAM). You can swap it for any MLX-compatible model by passing model_id to ModelAnchorExtractor. If you need cross-platform model extraction, implement your own extractor using any inference backend (ollama, vLLM, API calls) — the interface is a single extract(message) -> list[Anchor] method.

Monitoring

from breathe import BreatheMetrics

stats = BreatheMetrics.get().to_dict()
# {
#   "synapse": {
#     "total_injections": 142,
#     "hit_rate": 0.87,
#     "latency": {"avg_ms": 18.3, "p95_ms": 45.1},
#     "top_anchors": [{"text": "FastAPI", "count": 23}, ...]
#   },
#   "compaction": {
#     "total": 3,
#     "avg_ratio": 0.71,
#     "total_saved_tokens": 12400
#   }
# }

Expose via your API: GET /api/breathe-stats → BreatheMetrics.get().to_dict()

License

Apache 2.0 — see LICENSE.

Built by Kenaz GmbH — Custom AI Agents, MCP Servers, Semantic Engineering.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tkenaz

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

breathe_memory-0.1.0.tar.gz (43.3 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

breathe_memory-0.1.0-py3-none-any.whl (42.8 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file breathe_memory-0.1.0.tar.gz.

File metadata

Download URL: breathe_memory-0.1.0.tar.gz
Upload date: Mar 26, 2026
Size: 43.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for breathe_memory-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e1001b91d5d0b0442ecc2a268e28c19e36d4513271edabd8d4675da89acb8f30`
MD5	`72fd2caf84295a4ffadda52e91ad3842`
BLAKE2b-256	`fd12cbe3f31fbb03de3104ca81d9b2c1e5e2abbcde9a5282caf1c5dcc0732bf9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for breathe_memory-0.1.0.tar.gz:

Publisher: publish.yml on tkenaz/breathe-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: breathe_memory-0.1.0.tar.gz
- Subject digest: e1001b91d5d0b0442ecc2a268e28c19e36d4513271edabd8d4675da89acb8f30
- Sigstore transparency entry: 1185793826
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: tkenaz/breathe-memory@7e8a75a40e834b4781ed70e9953e526644dd483a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/tkenaz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7e8a75a40e834b4781ed70e9953e526644dd483a
- Trigger Event: release

File details

Details for the file breathe_memory-0.1.0-py3-none-any.whl.

File metadata

Download URL: breathe_memory-0.1.0-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 42.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for breathe_memory-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6ffc82494acc0f42571ca3e14ce8b3a3ec54b2bf5eca305ce1a200a4484f9c5`
MD5	`3c65329452658b33cdddc58c25533e4f`
BLAKE2b-256	`d6b284b7549df15b0b8dd8f52fb71f148bceae866ceba21fbfed3d364cce5a23`

See more details on using hashes here.

Provenance

The following attestation bundles were made for breathe_memory-0.1.0-py3-none-any.whl:

Publisher: publish.yml on tkenaz/breathe-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: breathe_memory-0.1.0-py3-none-any.whl
- Subject digest: f6ffc82494acc0f42571ca3e14ce8b3a3ec54b2bf5eca305ce1a200a4484f9c5
- Sigstore transparency entry: 1185793838
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: tkenaz/breathe-memory@7e8a75a40e834b4781ed70e9953e526644dd483a
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/tkenaz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7e8a75a40e834b4781ed70e9953e526644dd483a
- Trigger Event: release

breathe-memory 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

breathe-memory

What it does

Quick start

With Memory Nexus (PostgreSQL + pgvector)

Language support

Architecture

SYNAPSE pipeline (per-request, <200ms)

GraphCompactor (when context fills up)

Configuration

Implementing backends

MemoryRepository (for graph BFS + keyword search)

VectorSearchClient (for semantic search)

LLMClient (for GraphCompactor)

Performance

Memory management

Optional dependencies

Model extractor (Phase 3)

Monitoring

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance