A compact lifelong memory framework for LLM Agents.
Project description
LycheeMem is a compact memory framework for LLM agents. It starts from efficient conversational memory—through structured organization, lightweight consolidation, and adaptive retrieval—and gradually extends toward action-aware, usage-aware memory for more capable agentic systems.
🔥 News
- [03/30/2026] We evaluated LycheeMem on PinchBench with the OpenClaw plugin: compared to OpenClaw's native memory, it achieved an ~6% score improvement, while reducing token consumption by ~71% and cost by ~55%!
- [03/28/2026] Semantic memory has been upgraded to Compact Semantic Memory (SQLite + LanceDB), no Neo4j required. See /quick-start for details.
- [03/27/2026] OpenClaw Plugin is now available at /openclaw-plugin ! Setup guide →
- [03/26/2026] MCP support is available at /mcp !
- [03/23/2026] LycheeMem is now open source: GitHub Repository →
📚 Memory Architecture
LycheeMem organizes memory into three complementary stores:
| Working Memory | Semantic Memory | Procedural Memory |
|---|---|---|
|
(Episodic)
|
(Typed Action Store)
|
(Skills)
|
💾 Working Memory
The working memory window holds the active conversation context for a session. It operates under a dual-threshold token budget:
- Warn threshold (70%) — triggers asynchronous background pre-compression; the current request is not blocked.
- Block threshold (90%) — the pipeline pauses and flushes older turns to a compressed summary before proceeding.
Compression produces summary anchors (past context, distilled) + raw recent turns (last N turns, verbatim). Both are passed downstream as the conversation history.
🗺️ Semantic Memory — Compact Semantic Memory
Semantic memory is organised around typed MemoryRecords plus action-grounded retrieval state. The storage layer is SQLite (FTS5 full-text search) + LanceDB (vector index), while retrieval is conditioned on recent context, tentative action, constraints, and missing slots.
Memory Record Types
Each memory entry is stored as a MemoryRecord. The memory_type field distinguishes seven semantic categories:
| Type | Description |
|---|---|
fact |
Objective facts about the user, environment, or world |
preference |
User preferences (style, habits, likes/dislikes) |
event |
Specific events that have occurred |
constraint |
Conditions that must be respected |
procedure |
Reusable step-by-step procedures / methods |
failure_pattern |
Previously failed action paths and their causes |
tool_affordance |
Capabilities and applicable scenarios of tools/APIs |
Beyond text, every MemoryRecord carries action-facing metadata (tool_tags, constraint_tags, failure_tags, affordance_tags) and usage statistics (retrieval_count, action_success_count, etc.) to seed future reinforcement-learning signals. Retrieval logs also persist retrieval_plan, action_state, response excerpts, and later user feedback so the system can close a lightweight action-outcome loop without training.
Related MemoryRecords can be fused online by the Record Fusion Engine into denser CompositeRecords. Composite entries persist direct child_composite_ids, so long-term semantic memory is organised as a hierarchical memory tree instead of a flat bag of summaries.
Four-Module Pipeline
Module 1: Compact Semantic Encoding
A single-pass pipeline that converts conversation turns into a list of MemoryRecords:
- Typed extraction — LLM extracts self-contained facts and assigns a semantic category to each record.
- Decontextualization — Pronouns and context-dependent phrases are expanded into full expressions, so each record is understandable without the original dialogue.
- Action metadata annotation — LLM annotates each record with
memory_type,tool_tags,constraint_tags,failure_tags,affordance_tags, and other structured labels.
record_id = SHA256(normalized_text) — naturally idempotent; duplicate content is deduplicated automatically.
Module 2: Record Fusion, Conflict Update, and Hierarchical Consolidation
Triggered online after each consolidation:
- FTS / vector recall gathers related existing atomic records around the new records (candidate pool).
- The existing synthesis judge prompt decides whether each candidate set should produce a new
CompositeRecordor perform aconflict_updateagainst an existing atomic record. - On
conflict_update, the existing anchor record is updated in place, conflicting incoming records are soft-expired, and composites covering affected source records are invalidated. - On synthesis, the engine writes a new
CompositeRecordto SQLite + LanceDB. - Additional hierarchy rounds can synthesize
record -> compositeandcomposite -> composite, persistingchild_composite_idsso the memory tree can keep growing upward.
Module 3: Action-Grounded Retrieval Planning
Before retrieval, ActionAwareRetrievalPlanner analyses the user query + recent context + ActionState and emits a SearchPlan:
mode:answer(factual Q&A) /action(needs execution) /mixedsemantic_queries: content-facing search termspragmatic_queries: action/tool/constraint-facing search termstool_hints: tools likely needed for this requestrequired_constraints: constraints that must be respectedrequired_affordances: capabilities the retrieved memory should providemissing_slots: parameters / slots that are absenttree_retrieval_mode/tree_expansion_depth/include_leaf_records: whether retrieval should stay at high-level composites (root_only) or descend into child composites / direct leaf records (balanced/descend)
ActionState can carry fields such as current_subgoal, tentative_action, known_constraints, available_tools, failure_signal, and a recent-context excerpt. The planner merges this state with the LLM-produced plan so retrieval is conditioned on the current decision state rather than the query alone.
The plan drives multi-channel recall:
- FTS channel — SQLite FTS5 keyword recall over
MemoryRecord+CompositeRecord - Semantic vector channel — LanceDB ANN over
semantic_textembeddings - Normalised vector channel — LanceDB ANN over
normalized_textembeddings (for pragmatic queries) - Tag filter channel — exact filter by
tool_hints/required_constraints/required_affordances - Temporal channel — filter by
SearchPlan.temporal_filtertime window - Slot-hint supplementation — when
missing_slotsis non-empty, extra FTS/tag recall is triggered to find records that can fill missing parameters
After base recall, retrieval can also expand along the memory tree. root_only keeps high-level composite summaries, balanced descends one level when tree hints match, and descend pulls child composites plus direct leaf records when the current action needs finer-grained detail.
Module 4: Multi-Dimensional Scorer
Candidates from all channels are de-duplicated and ranked by MemoryScorer using a weighted linear combination. Final top-k selection is composite-first: covering parent composites are preferred, covered child records are folded away unless they add unique value, and near-duplicate fragments are suppressed.
$$\text{Score} = \alpha \cdot S_\text{sem} + \beta \cdot S_\text{action} + \kappa \cdot S_\text{slot} + \gamma \cdot S_\text{temporal} + \delta \cdot S_\text{recency} + \eta \cdot S_\text{evidence} - \lambda \cdot C_\text{token}$$
| Weight | Meaning | Default |
|---|---|---|
| α | SemanticRelevance (vector distance -> similarity) | 0.25 |
| β | ActionUtility (tag match score, mode-aware) | 0.25 |
| κ | SlotUtility (whether the memory helps fill missing action slots) | 0.15 |
| γ | TemporalFit (temporal reference match) | 0.15 |
| δ | Recency (memory freshness) | 0.10 |
| η | EvidenceDensity (evidence span density) | 0.10 |
| λ | TokenCost penalty (text length penalty) | 0.10 |
🛠️ Procedural Memory — Skill Store
The skill store preserves reusable how-to knowledge as structured skill entries, each carrying:
- Intent — a short description of what the skill does.
doc_markdown— a full Markdown document describing the procedure, commands, parameters, and caveats.- Embedding — a dense vector of the intent text, used for similarity search.
- Metadata — usage counters, last-used timestamp, preconditions.
Skill retrieval uses HyDE (Hypothetical Document Embeddings): the query is first expanded into a hypothetical ideal answer by the LLM, then that draft text is embedded to produce a query vector that matches well against stored procedure descriptions, even when the user's original phrasing is vague.
⚙️ Pipeline
Every request passes through a fixed sequence of five agents. Four are synchronous stages in the LangGraph pipeline; one is a background post-processing task.
Stage 1 — WMManager
Rule-based agent (no LLM prompt). Appends the user turn to the session log, counts tokens, and fires compression if either threshold is crossed. Produces compressed_history and raw_recent_turns for downstream stages.
Stage 2 — SearchCoordinator
SearchCoordinator first builds recent_context from compressed summaries + raw recent turns, then derives an ActionState from the current query, constraints, recent failures, token budget, and recent tool use. ActionAwareRetrievalPlanner uses that state to produce a SearchPlan containing mode, semantic_queries, pragmatic_queries, tool_hints, required_affordances, missing_slots, tree-traversal strategy, and more. Multi-channel recall (FTS, semantic vector, normalised vector, tag/affordance filter, temporal filter, slot-hint supplementation, plus tree expansion when needed) then queries SQLite + LanceDB. This stage returns raw semantic fragments, skill hits, retrieval provenance, and a dedicated novelty_retrieved_context built from pre-synthesis semantic fragments for later novelty checking; it does not build the final background_context yet. Skill retrieval is mode-aware (answer / action / mixed) and uses HyDE against the skill store only when it is likely to help.
When a new user turn arrives, SearchCoordinator also tries to apply lightweight feedback to the most recent unresolved action/mixed retrieval log, so the next turn can mark the prior memory usage as success / fail / correction.
Stage 3 — SynthesizerAgent
Acts as an LLM-as-Judge: scores every retrieved memory fragment on an absolute 0-1 relevance scale, discards fragments below the threshold (default 0.6), and fuses the survivors into a single dense background_context string. It also identifies skill_reuse_plan entries that can directly guide the final response. This stage is where the final answer-time context is built; it outputs provenance — a citation list containing scoring breakdown and source references for each kept memory item.
Stage 4 — ReasoningAgent
Receives compressed_history, background_context, and skill_reuse_plan and generates the final assistant reply. It appends the assistant turn back to the session store, and the pipeline finalizes the semantic usage log with a response excerpt so the next user turn can provide outcome feedback.
Background — ConsolidatorAgent
Triggered immediately after ReasoningAgent completes, runs in a thread pool and does not block the response. It:
- Performs a novelty check — LLM judges whether the conversation introduced new information worth persisting. Skips consolidation for pure retrieval exchanges.
- Compact consolidation — calls
CompactSemanticEngine.ingest_conversation(), which runs a single-pass encoder (typed extraction → decontextualization → action metadata annotation), writesMemoryRecords to SQLite + LanceDB, then triggers conflict-aware Record Fusion. Novelty check uses the search-stagenovelty_retrieved_context(raw semantic fragments), not the answer-timebackground_context, so query-conditioned synthesis does not suppress valid new-memory ingestion. - Skill extraction — identifies successful tool-usage patterns in the conversation and adds skill entries to the skill store. Runs in parallel with compact consolidation (ThreadPoolExecutor).
⚡ Quick Start
Prerequisites
- Python 3.11+
- An LLM API key (OpenAI, Gemini, or any litellm-compatible provider)
Installation
git clone https://github.com/LycheeMem/LycheeMem.git
cd LycheeMem
pip install -e .
Configuration
Copy .env.example to .env and fill in your values. The full template in .env.example also includes session/user DB paths, JWT settings, and working-memory thresholds; the snippet below shows the most important ones:
# LLM — litellm format: provider/model
LLM_MODEL=openai/gpt-4o-mini
LLM_API_KEY=sk-...
LLM_API_BASE= # optional
# Embedder
EMBEDDING_MODEL=openai/text-embedding-3-small
EMBEDDING_DIM=1536
EMBEDDING_API_KEY= # optional
EMBEDDING_API_BASE= # optional
Supported LLM providers (via litellm):
openai/gpt-4o-mini·gemini/gemini-2.0-flash·ollama_chat/qwen2.5· any OpenAI-compatible endpoint
Start the Server
python main.py
The API is served at http://localhost:8000. Interactive docs at /docs.
main.pycurrently starts Uvicorn without enabling live reload. For development reload, run Uvicorn directly, for example:uvicorn src.api.server:create_app --factory --reload
🎨 Web Demo
A frontend demo is included under web-demo/. It provides a chat interface alongside live views of the semantic memory tree, skill library, and working memory state.
cd web-demo
npm install
npm run dev # served at http://localhost:5173
Make sure the backend is running on port 8000 (or update proxy settings in
web-demo/vite.config.ts) before starting the frontend.
🦞 OpenClaw Plugin
LycheeMem ships a native OpenClaw plugin that gives any OpenClaw session persistent long-term memory with zero manual wiring.
What the plugin provides:
lychee_memory_smart_search— default long-term memory retrieval entry point- Automatic turn mirroring via hooks — the model does not need to call
append_turnmanually- User messages are appended automatically
- Assistant messages are appended automatically
/new,/reset,/stop, andsession_endautomatically trigger boundary consolidation- Proactive consolidation on strong long-term knowledge signals
Under normal operation:
- The model only calls
lychee_memory_smart_searchwhen recalling long-term context - The model may call
lychee_memory_consolidatemanually when an immediate persist is warranted - The model does not need to call
lychee_memory_append_turnat all
Quick Install
openclaw plugins install "/path/to/LycheeMem/openclaw-plugin"
openclaw gateway restart
See the full setup guide: openclaw-plugin/INSTALL_OPENCLAW.md
🔧 MCP
LycheeMem also exposes an HTTP MCP endpoint at http://localhost:8000/mcp.
- Available tools:
lychee_memory_smart_search,lychee_memory_search,lychee_memory_append_turn,lychee_memory_synthesize,lychee_memory_consolidate - Use
Authorization: Bearer <token>if you want per-user memory isolation lychee_memory_consolidateworks for sessions that already contain mirrored turns from/chat,/memory/reason, orlychee_memory_append_turn
MCP Transport
POST /mcphandles JSON-RPC requestsGET /mcpexposes the SSE stream used by some MCP clients- The server returns
Mcp-Session-Idduringinitialize; reuse that header on later requests
Authentication
If you want isolated memory per user, first obtain a JWT token from /auth/register or /auth/login, then send:
Authorization: Bearer <token>
Without a token, requests run with an empty user_id, so anonymous traffic shares the same namespace.
Client Configuration
For any MCP client that supports remote HTTP servers, configure the MCP URL as:
http://localhost:8000/mcp
Generic config example:
{
"mcpServers": {
"lycheemem": {
"url": "http://localhost:8000/mcp",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}
Manual JSON-RPC Flow
- Call
initialize - Reuse the returned
Mcp-Session-Id - Send
initialized - Call
tools/list - Call
tools/call
Initialize example:
curl -i -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2025-03-26",
"capabilities": {},
"clientInfo": {
"name": "debug-client",
"version": "0.1.0"
}
}
}'
Tool call example:
curl -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-H "Mcp-Session-Id: <session-id>" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "lychee_memory_smart_search",
"arguments": {
"query": "what tools do I use for database backups",
"top_k": 5,
"mode": "compact",
"include_graph": true,
"include_skills": true
}
}
}'
Recommended MCP Usage Pattern
- Use
/chator/memory/reasonwith a stablesession_idto write conversation turns, or mirror external host turns withlychee_memory_append_turn. - Use
lychee_memory_smart_searchincompactmode for the default one-shot recall path. - Use
lychee_memory_search+lychee_memory_synthesizeonly when you explicitly want search and synthesis as separate stages. - After the conversation ends, call
lychee_memory_consolidatewith the samesession_id.
🔌 API Reference
POST /memory/search — Unified Memory Retrieval
Query both the semantic memory channel and the skill store in a single call. New integrations should prefer semantic_results; graph_results is kept as a backward-compatible alias. The response also includes novelty_retrieved_context, which is the correct input for later /memory/consolidate calls.
// Request
{
"query": "what tools do I use for database backups",
"top_k": 5,
"include_graph": true,
"include_skills": true
}
// Response
{
"query": "...",
"graph_results": [
{
"anchor": {
"node_id": "compact_context",
"name": "CompactSemanticMemory",
"label": "SemanticContext",
"score": 1.0
},
"constructed_context": "...",
"provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
}
],
"semantic_results": [
{
"anchor": { "node_id": "compact_context", "name": "CompactSemanticMemory", "label": "SemanticContext", "score": 1.0 },
"constructed_context": "...",
"provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
}
],
"novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"skill_results": [ { "id": "...", "intent": "pg_dump backup to S3", "score": 0.87, ... } ],
"total": 6
}
POST /memory/smart-search — One-Shot Recall
Runs search and, optionally, synthesis in one API call. mode=compact is the default integration path when you want a concise background_context without handling intermediate payloads yourself. Even in compact mode, the response still returns novelty_retrieved_context so a host can consolidate against raw retrieved memory instead of answer-time synthesis.
// Request
{
"query": "what tools do I use for database backups",
"top_k": 5,
"synthesize": true,
"mode": "compact"
}
// Response
{
"query": "...",
"mode": "compact",
"synthesized": true,
"background_context": "User regularly uses pg_dump with a cron job...",
"skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
"provenance": [ { "record_id": "...", "source": "record", "score": 0.91, ... } ],
"novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"kept_count": 4,
"dropped_count": 2,
"total": 6
}
POST /memory/synthesize — Memory Fusion
Takes raw retrieval results and produces a fused memory context using LLM-as-Judge.
// Request
{
"user_query": "what tools do I use for database backups",
"semantic_results": [...], // preferred from /memory/search
"graph_results": [...], // compatibility alias also accepted
"skill_results": [...]
}
// Response
{
"background_context": "User regularly uses pg_dump with a cron job...",
"skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
"provenance": [ { "record_id": "...", "source": "semantic", "semantic_source_type": "record", "score": 0.91, ... } ],
"kept_count": 4,
"dropped_count": 2
}
POST /memory/reason — Grounded Reasoning
Runs the ReasoningAgent given pre-synthesized context. Can be chained after /memory/synthesize for full pipeline control.
// Request
{
"session_id": "my-session",
"user_query": "what tools do I use for database backups",
"background_context": "User regularly uses pg_dump...",
"skill_reuse_plan": [...],
"append_to_session": true // write result to session history (default: true)
}
// Response
{
"response": "You typically use pg_dump scheduled via cron...",
"session_id": "my-session",
"wm_token_usage": 3412
}
POST /memory/append-turn — Mirror External Host Turns
Appends one user or assistant turn into LycheeMem's session store so it can be consolidated later.
// Request
{
"session_id": "my-session",
"role": "user",
"content": "I usually back up PostgreSQL with pg_dump to S3."
}
// Response
{
"status": "appended",
"session_id": "my-session",
"turn_count": 3
}
POST /memory/consolidate — Trigger Consolidation
Manually trigger memory consolidation for a session. This is the primary consolidation endpoint and supports both background and synchronous modes.
retrieved_context should preferably be the novelty_retrieved_context returned by /memory/search or /memory/smart-search, i.e. the search-stage raw semantic fragments, not /memory/synthesize's background_context.
// Request
{
"session_id": "my-session",
"retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
"background": true
}
// Response (background mode)
{
"status": "started",
"entities_added": 0,
"skills_added": 0,
"facts_added": 0
}
Legacy compatibility endpoint: POST /memory/consolidate/{session_id}.
GET /memory/graph — Semantic Memory Tree
Returns the current semantic memory as a hierarchy. mode=cleaned (default) emits tree_roots plus direct tree edges for the frontend memory-tree view; mode=debug exposes the lower-level flattened relations for inspection.
GET /pipeline/status and GET /pipeline/last-consolidation
Use these endpoints for operational checks and background consolidation polling:
GET /pipeline/statusreturns aggregate counts for sessions, semantic memory, and skills.GET /pipeline/last-consolidation?session_id=<id>returns the latest consolidation result for a session, orpendingif the background task has not finished yet.
Usage Examples
# Basic single-turn demo (automatically registers 'demo_user')
python examples/api_pipeline_demo.py
# Multi-turn chat demo (3 consecutive turns, followed by consolidation)
python examples/api_pipeline_demo.py --multi-turn
# Custom query and user credentials
python examples/api_pipeline_demo.py --username alice --password secret123 \
--query "How do I backup my database with pg_dump?"
# Use a fixed session_id (useful for accumulating history across multiple runs)
python examples/api_pipeline_demo.py --session-id my-test-session
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lycheemem-0.1.0.tar.gz.
File metadata
- Download URL: lycheemem-0.1.0.tar.gz
- Upload date:
- Size: 46.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf6b43e0874682b161648d6ab9d0119d421913af433a4a1811175b490edaaf8d
|
|
| MD5 |
2a32b2d3bb9976214fe8db633f866da9
|
|
| BLAKE2b-256 |
4c385e40eda10a112efd0f95ee40570721db67a974e58bb08aa474a0b323623a
|
File details
Details for the file lycheemem-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lycheemem-0.1.0-py3-none-any.whl
- Upload date:
- Size: 166.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
800f501a71c3c2e704c81841718441a17577b3b9dd0aa0d60149c3f44fd87d04
|
|
| MD5 |
3b8f0d61e83054b15839d5de8263fc85
|
|
| BLAKE2b-256 |
e6e0429fcb4ae75b40420591faad7b01247c882c066836e83d9bb6dd48b90dc3
|