Persistent memory for LLM coding agents — local-first, model-agnostic, SQLite-backed
Project description
callmem
Renamed from
llm-memin v0.2.0. Existing installs keep working — thellm-memconsole script, thepython -m llm_mem.mcp.serverentry point, the.llm-mem/config dir, and theLLM_MEM_*env vars are all honored as aliases. Runcallmem migratein a project to update its names to the new canonical forms.
Persistent memory for LLM coding agents.
callmem gives coding agents a durable, searchable memory that survives across sessions. It captures what happened, compresses it in the background using a local LLM, and serves a compact briefing when the next session starts — so the agent picks up where you left off without manual context management.
Works with OpenCode and Claude Code side by side. Both tools write to the same memory database, so you can rate-limit-swap between them mid-project without losing context. The extraction LLM is also swappable at any time — switch models in config.toml and restart the daemon; prior memories stay intact and keep being usable.
Inspired by claude-mem, but built from the ground up with a different philosophy: model-agnostic, local-first, and designed around pluggable LLM backends rather than a single vendor.
Key Features
Automatic Context at Startup
When a new session begins, callmem generates a structured briefing with context economics, an emoji-coded observation timeline, and a session summary. This is written to SESSION_SUMMARY.md in your project root, where your coding agent picks it up automatically — no manual context management needed.
Real-Time Capture and Extraction
During the session, callmem ingests prompts, responses, tool calls, and file changes from your coding agent — either via OpenCode's SSE stream or by tailing Claude Code's JSONL transcripts at ~/.claude/projects/<slug>/*.jsonl. Both run concurrently, so a project using both agents sees unified history. A background worker runs entity extraction through your local LLM, pulling out decisions, facts, TODOs, bugs, features, and discoveries. New observations appear in the web UI within milliseconds via Server-Sent Events.
Layered Compression
Raw events are compressed through multiple layers to keep token usage in check:
- Entity extraction — structured knowledge pulled from raw conversation
- Chunk summaries — rolling mid-session compression every N events
- Session summaries — structured wrap-up when a session ends (Investigated / Learned / Completed / Next Steps)
- Cross-session summaries — periodic project-level rollups
- Compaction — old events archived to keep the database lean
Dual Content Views
Each observation has two representations:
- Key Points — bullet-point summary (~50-100 tokens), cheap for context injection
- Synopsis — flowing prose paragraph (~200-400 tokens), loaded on demand
Pluggable LLM Backend
callmem doesn't care which model you use for coding. It uses a separate local model for memory maintenance:
- Ollama (recommended) — fully local, zero API cost, works offline
- OpenAI-compatible — any
/v1/chat/completionsAPI (LM Studio, vLLM, etc.) - None — pattern-only mode when no LLM is available
The extraction model can be swapped at any time. Change [ollama].model (or [openai_compat].model) in config.toml and restart the daemon — previously extracted entities remain valid and fully searchable, and new events get extracted with whichever model is configured now. Tested upgrades: qwen3:8b ↔ qwen3:30b ↔ gemma4:e4b. You can also flip [llm].backend between ollama, openai_compat, and none without touching the database.
Web UI
A local web interface for browsing and managing memories:
- Card-based feed with colour-coded category badges
- Expandable cards with Key Points / Synopsis toggle
- Real-time updates via SSE
- Full-text search across all entities
- Session browser with event timeline
- Briefing preview showing exactly what the agent sees
- Accessible from your Tailscale network (default bind
0.0.0.0)
Sensitive Data Handling
Two-layer detection (pattern matching + LLM classification) catches secrets, credentials, and PII at ingest time. Detected items are redacted from memory and stored in an encrypted vault with configurable false-positive management.
MCP Integration
Exposes tools via the Model Context Protocol so compatible agents can query memory on demand:
search— full-text search across all observationsget_briefing— generate a startup briefingsearch_by_file— find observations related to specific filestimeline— chronological context around an observation
Requirements
- Python 3.10+
- Ollama with a summarisation model (recommended:
qwen3:8borqwen3:30b) - Linux (tested on Ubuntu/Debian x86_64 and ARM64)
- SQLite 3.35+ (FTS5 support required)
Tested Environments
| Environment | Status |
|---|---|
| Ubuntu 24.04, x86_64 | Tested |
| Ubuntu 24.04, ARM64 | Tested |
| Ollama with qwen3:8b | Tested |
| Ollama with qwen3:30b | Tested |
| Ollama with gemma4:e4b | Tested |
| OpenCode as coding agent | Tested |
| Claude Code as coding agent | Tested |
| Running both OpenCode and Claude Code against one project | Tested |
macOS and Windows are untested but should work anywhere Python and Ollama run. The systemd service integration is Linux-only.
Quick Start
1. Prerequisites
# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a summarisation model (any instruction-following model works)
ollama pull qwen3:8b
2. Install callmem
# Install with pip (recommended)
pip install callmem
# Or with uv
uv pip install callmem
# Or install from source for development
git clone https://github.com/DANgerous25/callmem.git
cd callmem
uv sync --extra dev
3. Run the Setup Wizard
uv run callmem setup
The wizard walks you through:
- Choosing your LLM backend and model
- Setting the web UI port (with multi-project conflict detection)
- Configuring network bind address
- Picking your coding tools (OpenCode / Claude Code / both) — writes
opencode.jsonand/or.mcp.jsonas needed, and patchesAGENTS.md/CLAUDE.mdwith MCP usage instructions - Importing existing OpenCode sessions from SQLite
- Optionally installing a systemd user service for auto-start
The setup script is safe to re-run — it reconfigures without wiping data and backs up your config.toml before changes.
4. Start the Daemon
# All-in-one: web UI + background workers + OpenCode SSE adapter
# + Claude Code JSONL tailer. Each adapter is independently gated
# by [adapters].opencode / [adapters].claude_code in config.toml.
uv run callmem daemon
# Or via make
make daemon
# Or via systemd (if installed during setup)
make start
Restarting after an upgrade or config change. The systemd service holds the old code in memory, so git pull / pip install -e . / config.toml edits only take effect after a restart:
make restart # this project (preferred)
make logs # tail journalctl for the service
# Or raw systemctl — the unit is named callmem-<project-dir-name>:
systemctl --user restart callmem-<project>.service
systemctl --user status callmem-<project>.service
Each project that ran callmem setup with systemd enabled has its own unit, so restart the one matching your project directory.
5. Configure Your Coding Agent
callmem setup and callmem init both write the right config file(s) automatically. If you prefer to wire it up by hand:
OpenCode — add to opencode.json in your project:
{
"mcp": {
"callmem": {
"type": "local",
"command": ["python3", "-m", "callmem.mcp.server", "--project", "."],
"enabled": true
}
}
}
Claude Code — add to .mcp.json in your project (note the split command/args):
{
"mcpServers": {
"callmem": {
"command": "python3",
"args": ["-m", "callmem.mcp.server", "--project", "."]
}
}
}
Both can coexist; they share the same SQLite database.
Once configured, you can type /briefing in OpenCode to display the startup briefing on demand. In Claude Code, ask the agent to call the mem_get_briefing MCP tool, or read SESSION_SUMMARY.md directly.
6. Open the Web UI
Navigate to http://localhost:9090 (or your configured host:port).
How It Works
┌─────────────────┐ SSE ┌─────────────────┐ Extract ┌──────────────┐
│ OpenCode │ ──────────▶ │ callmem │ ─────────────▶ │ Local LLM │
└─────────────────┘ │ Adapters │ │ (Ollama or │
┌─────────────────┐ JSONL │ (opencode + │ │ OpenAI-like)│
│ Claude Code │ ──────────▶ │ claude_code) │ └──────────────┘
└─────────────────┘ └────────┬────────┘
│
▼
┌─────────────────┐
│ SQLite DB │
│ + FTS5 │
└────────┬────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Web UI │ │ MCP │ │ Briefing │
│ Feed │ │ Server │ │ Writer │
└──────────┘ └──────────┘ └──────────┘
- Capture: The OpenCode adapter subscribes to its SSE stream; the Claude Code adapter tails each transcript file under
~/.claude/projects/<slug>/using a persistent byte-offset, so a restart resumes mid-file instead of replaying or dropping records. Both run inside the same daemon process. - Extract: A background worker sends event batches to your local LLM for entity extraction — decisions, facts, TODOs, bugs, features, discoveries. Switching the extraction model later does not invalidate past entities.
- Compress: Summaries are generated at chunk, session, and cross-session levels.
- Serve: The briefing writer generates
SESSION_SUMMARY.mdin your project root; the MCP server responds to on-demand queries from whichever agent you're using; the web UI shows everything in real-time.
CLI Reference
callmem setup # Interactive setup wizard
callmem daemon # Start UI + workers + adapter in one process
callmem ui # Start web UI only
callmem serve # Start MCP server only
callmem import --source opencode --all # Import OpenCode sessions from SQLite
callmem import --source claude-code --all # Import Claude Code transcripts (JSONL)
callmem import --status # Show current/last import progress
callmem status # Show service status
callmem search <query> # Search memories from the command line
callmem briefing # Generate and print a briefing
callmem briefing --write # Write briefing to SESSION_SUMMARY.md
make restart # Restart the systemd unit for this project
make logs # Tail journalctl for the service
Entity Categories
callmem extracts these observation types, each with a colour-coded badge in the UI:
| Category | Icon | Description |
|---|---|---|
| Feature | 🟢 | New functionality added |
| Bugfix | 🔴 | Bug identified and/or fixed |
| Discovery | 🔵 | Notable insight or finding |
| Decision | ⚖️ | Architectural or design choice |
| Todo | 📋 | Task to be done |
| Fact | 📝 | Durable project knowledge |
| Failure | ❌ | Error or failure encountered |
| Research | 🔬 | Investigation or analysis |
| Change | 🔄 | General code or file change |
Configuration
All settings live in .callmem/config.toml in your project root. Key options:
[llm]
backend = "ollama" # ollama | openai_compat | none
model = "qwen3:8b"
api_base = "http://localhost:11434"
[ui]
host = "0.0.0.0" # Bind address (0.0.0.0 for Tailscale access)
port = 9090
[adapters]
opencode = true # Listen on OpenCode's SSE stream
claude_code = true # Tail ~/.claude/projects/<slug>/*.jsonl
claude_code_poll_interval = 2.0 # seconds between disk scans
claude_code_idle_timeout = 300 # seconds before an idle CC session is closed
[briefing]
max_tokens = 2000 # Token budget for startup briefing
auto_write_session_summary = true
session_summary_filename = "SESSION_SUMMARY.md"
[extraction]
batch_size = 10 # Events per extraction batch
See docs/config.md for the full reference.
Why SQLite, Not a Vector DB?
Most coding memory retrieval is structured:
- "What decisions did we make about auth?"
- "What TODOs are still open?"
- "What happened in the last 3 sessions?"
These are better served by structured tables + full-text search (FTS5) than embedding similarity. SQLite is zero-dependency, single-file, trivially backed up, and fast enough for tens of thousands of memories.
Vector embeddings are a planned enhancement for semantic retrieval when keyword search isn't enough. The schema is designed to accommodate them without migration pain.
Project Structure
callmem/
├── src/callmem/
│ ├── adapters/ # OpenCode SSE + import, Claude Code live tailer + import
│ ├── core/ # Engine, extraction, briefing, compression, event bus
│ ├── mcp/ # MCP server and tool definitions
│ ├── models/ # Data models, config, entity types
│ └── ui/ # Web UI (FastAPI + Jinja2 + htmx + SSE)
├── tests/ # 630+ tests (unit + integration)
├── docs/ # Architecture, schema, config, roadmap docs
└── pyproject.toml
Development
# Install with dev dependencies
uv sync --extra dev
# Run tests
make test
# Run linter
make lint
# Run both
make check
Roadmap
See docs/roadmap.md for the full plan. Highlights:
- Progressive disclosure search — 3-layer MCP search pattern (index → timeline → full details)
- File-level tracking — associate observations with specific files
- Knowledge agents — build queryable corpora from observation history
- Settings panel — web UI for config with live briefing preview
- Vector search — optional semantic retrieval via sentence-transformers
Acknowledgements
callmem was inspired by claude-mem by Alex Newman. We share the same goal — giving coding agents persistent memory — but callmem is built from scratch with a focus on model-agnostic operation, local-first architecture, and pluggable LLM backends.
Known Issues
Auto-briefing plugin does not trigger on session start
An OpenCode plugin (.opencode/plugins/auto-briefing.js) is installed during setup that should auto-display the briefing when a new session starts. However, due to an upstream OpenCode bug where session.created events do not fire for plugins, this does not currently work. Use the /briefing command in OpenCode as a workaround. The plugin will activate automatically when the bug is fixed upstream — no changes needed.
Claude Code: tool results and thinking blocks are not ingested
The Claude Code adapter maps user prompts, assistant text, and tool_use blocks into the memory feed. tool_result blocks (system-side responses to tool calls) and thinking blocks are skipped in the current release to keep signal-to-noise high. A follow-up will revisit this — until then, a tool call appears in the feed but its outcome does not.
Python 3.10 compatibility shims
callmem supports Python 3.10+ but requires tomli (backport of tomllib) and typing_extensions on Python 3.10. These are installed automatically as conditional dependencies. On Python 3.11+, the stdlib equivalents are used.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file callmem-0.3.1.tar.gz.
File metadata
- Download URL: callmem-0.3.1.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa36ef3a87114d882529cad53393753eceb7510205d8fdf0c3d6802676deba7f
|
|
| MD5 |
8b0e3c7d0be1448bb0590d79fe43bd88
|
|
| BLAKE2b-256 |
ed3160f58e3d9a4b3cb259d1bb73657691b56fcc405171ca673155d9531de054
|
File details
Details for the file callmem-0.3.1-py3-none-any.whl.
File metadata
- Download URL: callmem-0.3.1-py3-none-any.whl
- Upload date:
- Size: 191.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d979a32ad3c53730b26418f1efbc424295e22320e4cb1f0b8ad6e4df96afa98d
|
|
| MD5 |
dd3d39c663df7782e1aa65932588e225
|
|
| BLAKE2b-256 |
4ff63d3549ad7e675d4f89ad6fedcd7745b10a59295223ec38e032a7cd6bba4f
|