Skip to main content

Persistent memory for LLM coding agents — local-first, model-agnostic, SQLite-backed

Project description

callmem

Renamed from llm-mem in v0.2.0. Existing installs keep working — the llm-mem console script, the python -m llm_mem.mcp.server entry point, the .llm-mem/ config dir, and the LLM_MEM_* env vars are all honored as aliases. Run callmem migrate in a project to update its names to the new canonical forms.

Persistent memory for LLM coding agents.

callmem gives coding agents a durable, searchable memory that survives across sessions. It captures what happened, compresses it in the background using a local LLM, and serves a compact briefing when the next session starts — so the agent picks up where you left off without manual context management.

Works with OpenCode and Claude Code side by side. Both tools write to the same memory database, so you can rate-limit-swap between them mid-project without losing context. The extraction LLM is also swappable at any time — switch models in config.toml and restart the daemon; prior memories stay intact and keep being usable.

Inspired by claude-mem, but built from the ground up with a different philosophy: model-agnostic, local-first, and designed around pluggable LLM backends rather than a single vendor.


Key Features

Automatic Context at Startup

When a new session begins, callmem generates a structured briefing with context economics, an emoji-coded observation timeline, and a session summary. This is written to SESSION_SUMMARY.md in your project root, where your coding agent picks it up automatically — no manual context management needed.

Real-Time Capture and Extraction

During the session, callmem ingests prompts, responses, tool calls, and file changes from your coding agent — either via OpenCode's SSE stream or by tailing Claude Code's JSONL transcripts at ~/.claude/projects/<slug>/*.jsonl. Both run concurrently, so a project using both agents sees unified history. A background worker runs entity extraction through your local LLM, pulling out decisions, facts, TODOs, bugs, features, and discoveries. New observations appear in the web UI within milliseconds via Server-Sent Events.

Layered Compression

Raw events are compressed through multiple layers to keep token usage in check:

  • Entity extraction — structured knowledge pulled from raw conversation
  • Chunk summaries — rolling mid-session compression every N events
  • Session summaries — structured wrap-up when a session ends (Investigated / Learned / Completed / Next Steps)
  • Cross-session summaries — periodic project-level rollups
  • Compaction — old events archived to keep the database lean

Dual Content Views

Each observation has two representations:

  • Key Points — bullet-point summary (~50-100 tokens), cheap for context injection
  • Synopsis — flowing prose paragraph (~200-400 tokens), loaded on demand

Pluggable LLM Backend

callmem doesn't care which model you use for coding. It uses a separate local model for memory maintenance:

  • Ollama (recommended) — fully local, zero API cost, works offline
  • OpenAI-compatible — any /v1/chat/completions API (LM Studio, vLLM, etc.)
  • None — pattern-only mode when no LLM is available

The extraction model can be swapped at any time. Change [ollama].model (or [openai_compat].model) in config.toml and restart the daemon — previously extracted entities remain valid and fully searchable, and new events get extracted with whichever model is configured now. Tested upgrades: qwen3:8bqwen3:30bgemma4:e4b. You can also flip [llm].backend between ollama, openai_compat, and none without touching the database.

Web UI

A local web interface for browsing and managing memories:

  • Card-based feed with colour-coded category badges
  • Expandable cards with Key Points / Synopsis toggle
  • Real-time updates via SSE
  • Full-text search across all entities
  • Session browser with event timeline
  • Briefing preview showing exactly what the agent sees
  • Accessible from your Tailscale network (default bind 0.0.0.0)

Sensitive Data Handling

Two-layer detection (pattern matching + LLM classification) catches secrets, credentials, and PII at ingest time. Detected items are redacted from memory and stored in an encrypted vault with configurable false-positive management.

MCP Integration

Exposes tools via the Model Context Protocol so compatible agents can query memory on demand:

  • search — full-text search across all observations
  • get_briefing — generate a startup briefing
  • search_by_file — find observations related to specific files
  • timeline — chronological context around an observation

Requirements

  • Python 3.10+
  • Ollama with a summarisation model (recommended: qwen3:8b or qwen3:30b)
  • Linux (tested on Ubuntu/Debian x86_64 and ARM64)
  • SQLite 3.35+ (FTS5 support required)

Tested Environments

Environment Status
Ubuntu 24.04, x86_64 (Hetzner VPS) Tested
Ubuntu 24.04, ARM64 (Hetzner CAX31) Tested
Ollama with qwen3:8b Tested
Ollama with qwen3:30b Tested
Ollama with gemma4:e4b Tested
OpenCode as coding agent Tested
Claude Code as coding agent Tested
Running both OpenCode and Claude Code against one project Tested

macOS and Windows are untested but should work anywhere Python and Ollama run. The systemd service integration is Linux-only.


Quick Start

1. Prerequisites

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a summarisation model (any instruction-following model works)
ollama pull qwen3:8b

2. Install callmem

# Install with pip (recommended)
pip install callmem

# Or with uv
uv pip install callmem

# Or install from source for development
git clone https://github.com/callmem/callmem.git
cd callmem
uv sync --extra dev

3. Run the Setup Wizard

uv run callmem setup

The wizard walks you through:

  • Choosing your LLM backend and model
  • Setting the web UI port (with multi-project conflict detection)
  • Configuring network bind address
  • Picking your coding tools (OpenCode / Claude Code / both) — writes opencode.json and/or .mcp.json as needed, and patches AGENTS.md / CLAUDE.md with MCP usage instructions
  • Importing existing OpenCode sessions from SQLite
  • Optionally installing a systemd user service for auto-start

The setup script is safe to re-run — it reconfigures without wiping data and backs up your config.toml before changes.

4. Start the Daemon

# All-in-one: web UI + background workers + OpenCode SSE adapter
# + Claude Code JSONL tailer. Each adapter is independently gated
# by [adapters].opencode / [adapters].claude_code in config.toml.
uv run callmem daemon

# Or via make
make daemon

# Or via systemd (if installed during setup)
make start

Restarting after an upgrade or config change. The systemd service holds the old code in memory, so git pull / pip install -e . / config.toml edits only take effect after a restart:

make restart          # this project (preferred)
make logs             # tail journalctl for the service

# Or raw systemctl — the unit is named callmem-<project-dir-name>:
systemctl --user restart callmem-<project>.service
systemctl --user status  callmem-<project>.service

Each project that ran callmem setup with systemd enabled has its own unit, so restart the one matching your project directory.

5. Configure Your Coding Agent

callmem setup and callmem init both write the right config file(s) automatically. If you prefer to wire it up by hand:

OpenCode — add to opencode.json in your project:

{
  "mcp": {
    "callmem": {
      "type": "local",
      "command": ["python3", "-m", "callmem.mcp.server", "--project", "."],
      "enabled": true
    }
  }
}

Claude Code — add to .mcp.json in your project (note the split command/args):

{
  "mcpServers": {
    "callmem": {
      "command": "python3",
      "args": ["-m", "callmem.mcp.server", "--project", "."]
    }
  }
}

Both can coexist; they share the same SQLite database.

Once configured, you can type /briefing in OpenCode to display the startup briefing on demand. In Claude Code, ask the agent to call the mem_get_briefing MCP tool, or read SESSION_SUMMARY.md directly.

6. Open the Web UI

Navigate to http://localhost:9090 (or your configured host:port).


How It Works

┌─────────────────┐   SSE        ┌─────────────────┐   Extract      ┌──────────────┐
│  OpenCode       │ ──────────▶  │   callmem       │ ─────────────▶ │  Local LLM   │
└─────────────────┘              │   Adapters      │                │  (Ollama or  │
┌─────────────────┐   JSONL      │  (opencode +    │                │  OpenAI-like)│
│  Claude Code    │ ──────────▶  │   claude_code)  │                └──────────────┘
└─────────────────┘              └────────┬────────┘
                                          │
                                          ▼
                                 ┌─────────────────┐
                                 │   SQLite DB     │
                                 │   + FTS5        │
                                 └────────┬────────┘
                                          │
                              ┌───────────┼───────────┐
                              ▼           ▼           ▼
                        ┌──────────┐ ┌──────────┐ ┌──────────┐
                        │ Web UI   │ │ MCP      │ │ Briefing │
                        │ Feed     │ │ Server   │ │ Writer   │
                        └──────────┘ └──────────┘ └──────────┘
  1. Capture: The OpenCode adapter subscribes to its SSE stream; the Claude Code adapter tails each transcript file under ~/.claude/projects/<slug>/ using a persistent byte-offset, so a restart resumes mid-file instead of replaying or dropping records. Both run inside the same daemon process.
  2. Extract: A background worker sends event batches to your local LLM for entity extraction — decisions, facts, TODOs, bugs, features, discoveries. Switching the extraction model later does not invalidate past entities.
  3. Compress: Summaries are generated at chunk, session, and cross-session levels.
  4. Serve: The briefing writer generates SESSION_SUMMARY.md in your project root; the MCP server responds to on-demand queries from whichever agent you're using; the web UI shows everything in real-time.

CLI Reference

callmem setup              # Interactive setup wizard
callmem daemon             # Start UI + workers + adapter in one process
callmem ui                 # Start web UI only
callmem serve              # Start MCP server only
callmem import --source opencode     --all  # Import OpenCode sessions from SQLite
callmem import --source claude-code  --all  # Import Claude Code transcripts (JSONL)
callmem import --status                     # Show current/last import progress
callmem status             # Show service status
callmem search <query>     # Search memories from the command line
callmem briefing           # Generate and print a briefing
callmem briefing --write   # Write briefing to SESSION_SUMMARY.md

make restart               # Restart the systemd unit for this project
make logs                  # Tail journalctl for the service

Entity Categories

callmem extracts these observation types, each with a colour-coded badge in the UI:

Category Icon Description
Feature 🟢 New functionality added
Bugfix 🔴 Bug identified and/or fixed
Discovery 🔵 Notable insight or finding
Decision ⚖️ Architectural or design choice
Todo 📋 Task to be done
Fact 📝 Durable project knowledge
Failure Error or failure encountered
Research 🔬 Investigation or analysis
Change 🔄 General code or file change

Configuration

All settings live in .callmem/config.toml in your project root. Key options:

[llm]
backend = "ollama"               # ollama | openai_compat | none
model = "qwen3:8b"
api_base = "http://localhost:11434"

[ui]
host = "0.0.0.0"                 # Bind address (0.0.0.0 for Tailscale access)
port = 9090

[adapters]
opencode = true                  # Listen on OpenCode's SSE stream
claude_code = true               # Tail ~/.claude/projects/<slug>/*.jsonl
claude_code_poll_interval = 2.0  # seconds between disk scans
claude_code_idle_timeout = 300   # seconds before an idle CC session is closed

[briefing]
max_tokens = 2000                # Token budget for startup briefing
auto_write_session_summary = true
session_summary_filename = "SESSION_SUMMARY.md"

[extraction]
batch_size = 10                  # Events per extraction batch

See docs/config.md for the full reference.


Why SQLite, Not a Vector DB?

Most coding memory retrieval is structured:

  • "What decisions did we make about auth?"
  • "What TODOs are still open?"
  • "What happened in the last 3 sessions?"

These are better served by structured tables + full-text search (FTS5) than embedding similarity. SQLite is zero-dependency, single-file, trivially backed up, and fast enough for tens of thousands of memories.

Vector embeddings are a planned enhancement for semantic retrieval when keyword search isn't enough. The schema is designed to accommodate them without migration pain.


Project Structure

callmem/
├── src/callmem/
│   ├── adapters/       # OpenCode SSE + import, Claude Code live tailer + import
│   ├── core/           # Engine, extraction, briefing, compression, event bus
│   ├── mcp/            # MCP server and tool definitions
│   ├── models/         # Data models, config, entity types
│   └── ui/             # Web UI (FastAPI + Jinja2 + htmx + SSE)
├── tests/              # 630+ tests (unit + integration)
├── docs/               # Architecture, schema, config, roadmap docs
└── pyproject.toml

Development

# Install with dev dependencies
uv sync --extra dev

# Run tests
make test

# Run linter
make lint

# Run both
make check

Roadmap

See docs/roadmap.md for the full plan. Highlights:

  • Progressive disclosure search — 3-layer MCP search pattern (index → timeline → full details)
  • File-level tracking — associate observations with specific files
  • Knowledge agents — build queryable corpora from observation history
  • Settings panel — web UI for config with live briefing preview
  • Vector search — optional semantic retrieval via sentence-transformers

Acknowledgements

callmem was inspired by claude-mem by Alex Newman. We share the same goal — giving coding agents persistent memory — but callmem is built from scratch with a focus on model-agnostic operation, local-first architecture, and pluggable LLM backends.


Known Issues

Auto-briefing plugin does not trigger on session start

An OpenCode plugin (.opencode/plugins/auto-briefing.js) is installed during setup that should auto-display the briefing when a new session starts. However, due to an upstream OpenCode bug where session.created events do not fire for plugins, this does not currently work. Use the /briefing command in OpenCode as a workaround. The plugin will activate automatically when the bug is fixed upstream — no changes needed.

Claude Code: tool results and thinking blocks are not ingested

The Claude Code adapter maps user prompts, assistant text, and tool_use blocks into the memory feed. tool_result blocks (system-side responses to tool calls) and thinking blocks are skipped in the current release to keep signal-to-noise high. A follow-up will revisit this — until then, a tool call appears in the feed but its outcome does not.

Python 3.10 compatibility shims

callmem supports Python 3.10+ but requires tomli (backport of tomllib) and typing_extensions on Python 3.10. These are installed automatically as conditional dependencies. On Python 3.11+, the stdlib equivalents are used.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

callmem-0.3.0.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

callmem-0.3.0-py3-none-any.whl (172.5 kB view details)

Uploaded Python 3

File details

Details for the file callmem-0.3.0.tar.gz.

File metadata

  • Download URL: callmem-0.3.0.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for callmem-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c7be31af29f308ceaa4c7915bc8c774816e353844f43040a42b875bea0ecefd8
MD5 14118494715511b2ebbc82bcfefe7046
BLAKE2b-256 5d49a51a3e6afcac3db790ab61a3614057276bd39455ea773560ddf0c6a99547

See more details on using hashes here.

File details

Details for the file callmem-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: callmem-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 172.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for callmem-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 570206e62e6a416084151020dcdd3742325198294a3de5c8ed73c31f73d67070
MD5 eed39714dabe0d6e8ed3fafe38beeee3
BLAKE2b-256 6055bc9c07322079b2c9d74ba04a4ae7b536882e6c082695876b82d5042a3841

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page