Recursive reasoning engine for AI agents and vector databases, powered by RLM.
Project description
DeepRecall
Recursive reasoning over your data. Plug into any vector DB or agent framework.
Standard RAG retrieves documents once and stuffs them into a prompt. DeepRecall uses MIT's Recursive Language Models to let your LLM search, reason, search again, and repeat -- until it actually has enough information to answer properly.
The LLM gets a search_db() function injected into a sandboxed Python REPL. It decides what to search for, analyzes results with code, refines its queries based on what it found, and synthesizes a final answer. This is not a fixed pipeline -- the LLM drives the retrieval strategy.
Install
pip install deeprecall[chroma] # ChromaDB (local, zero-config)
pip install deeprecall[milvus] # Milvus
pip install deeprecall[qdrant] # Qdrant
pip install deeprecall[pinecone] # Pinecone
pip install deeprecall[redis] # Redis distributed cache
pip install deeprecall[otel] # OpenTelemetry tracing
pip install deeprecall[all] # Everything
Quick Start
from deeprecall import DeepRecall
from deeprecall.vectorstores import ChromaStore
store = ChromaStore(collection_name="my_docs")
store.add_documents(["doc 1 text...", "doc 2 text...", "doc 3 text..."])
engine = DeepRecall(
vectorstore=store,
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini", "api_key": "sk-..."},
)
result = engine.query("What are the key themes across these documents?")
print(result.answer)
print(f"Sources: {len(result.sources)}")
print(f"Steps: {len(result.reasoning_trace)}")
print(f"Time: {result.execution_time:.1f}s")
What's New in v0.2
Budget Guardrails
Control exactly how much a query can spend -- tokens, time, searches, or dollars.
from deeprecall import DeepRecall, QueryBudget
engine = DeepRecall(vectorstore=store, backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"})
result = engine.query(
"Complex multi-hop question?",
budget=QueryBudget(
max_search_calls=10, # Stop after 10 vector DB searches
max_tokens=50000, # Total token budget
max_time_seconds=30.0, # Wall-clock timeout
),
)
# Check what was used
print(result.budget_status) # {"iterations_used": 5, "search_calls_used": 8, ...}
Reasoning Trace
Full visibility into what the LLM did at every step -- code executed, outputs, searches made.
result = engine.query("What caused the 2008 financial crisis?")
for step in result.reasoning_trace:
print(f"Step {step.iteration}: {step.action}")
if step.searches:
print(f" Searched: {[s['query'] for s in step.searches]}")
if step.code:
print(f" Code: {step.code[:100]}...")
Callbacks
Hook into the reasoning pipeline for monitoring, logging, or custom integrations.
from deeprecall import DeepRecall, DeepRecallConfig, ConsoleCallback, JSONLCallback
config = DeepRecallConfig(
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"},
callbacks=[
ConsoleCallback(), # Live step-by-step output
JSONLCallback(log_dir="./logs"), # Structured logging
],
)
engine = DeepRecall(vectorstore=store, config=config)
OpenTelemetry Tracing
Emit distributed traces to Jaeger, Datadog, Grafana Tempo, Honeycomb, or any OTLP backend.
from deeprecall import DeepRecall, DeepRecallConfig, OpenTelemetryCallback
otel = OpenTelemetryCallback(
service_name="my-rag-service",
# endpoint="https://otlp.datadoghq.com:4317", # Datadog
# headers={"DD-API-KEY": "your-key"},
)
config = DeepRecallConfig(
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"},
callbacks=[otel],
)
# Every query() call emits a trace with child spans for each reasoning step and search
Caching (In-Memory, Disk, Redis)
Avoid redundant LLM and vector DB calls. Three backends: in-memory (dev), SQLite (single-machine), Redis (distributed/production).
from deeprecall import DeepRecall, DeepRecallConfig, InMemoryCache, RedisCache
# In-memory (fastest, ephemeral)
config = DeepRecallConfig(
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"},
cache=InMemoryCache(max_size=500, default_ttl=3600),
)
# Redis (distributed, production -- works with AWS ElastiCache, GCP Memorystore, etc.)
config = DeepRecallConfig(
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"},
cache=RedisCache(url="redis://localhost:6379/0"),
# Or: RedisCache(url="rediss://my-cluster.abc123.cache.amazonaws.com:6379/0")
)
engine = DeepRecall(vectorstore=store, config=config)
# Second identical query hits cache -- zero LLM cost
Reranking
Improve search quality with Cohere or cross-encoder rerankers.
from deeprecall.core.reranker import CohereReranker
config = DeepRecallConfig(
backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"},
reranker=CohereReranker(api_key="co-..."),
)
Async Support & Thread Safety
DeepRecall is designed for high-concurrency production use. Every blocking operation (LLM calls, vector DB searches, cache I/O, file writes) is offloaded from the async event loop via asyncio.to_thread(). All shared state is protected with proper synchronization.
from deeprecall import AsyncDeepRecall
engine = AsyncDeepRecall(vectorstore=store, backend="openai",
backend_kwargs={"model_name": "gpt-4o-mini"})
# Non-blocking -- multiple queries can run concurrently
result = await engine.query("question")
await engine.add_documents(["new doc..."])
Thread safety highlights:
- Server endpoints --
query,add_documents,cache/clearall run in the thread pool, never blocking the event loop - Callbacks --
UsageTrackingCallbackcounters andJSONLCallbackfile writes are lock-protected for concurrent queries - OpenTelemetry -- span state is thread-local, so parallel queries produce isolated traces
- Rate limiter -- bucket state is lock-protected against concurrent access
- Redis cache -- uses the thread-safe
redis-pyclient; hit/miss counters are lock-protected - Auth middleware -- supports both sync and async
validate_fn; sync validators run in a thread
Server Auth & Rate Limiting
deeprecall serve --api-keys "key1,key2" --rate-limit 60 --port 8000
How It Works
- A lightweight HTTP server wraps your vector store on a random port
- A
search_db(query, top_k)function is injected into the RLM's sandboxed REPL - The LLM enters a recursive loop -- it can search, write Python, call sub-LLMs, and search again
- When it has enough info, it returns a
FINAL()answer - You get back the answer, sources, full reasoning trace, budget usage, and confidence score
Vector Stores
| Store | Install | Needs embedding_fn? |
|---|---|---|
| ChromaDB | deeprecall[chroma] |
No (built-in) |
| Milvus | deeprecall[milvus] |
Yes |
| Qdrant | deeprecall[qdrant] |
Yes |
| Pinecone | deeprecall[pinecone] |
Yes |
All stores implement the same interface: add_documents(), search(), delete(), count().
Framework Adapters
LangChain / LlamaIndex / OpenAI-compatible API -- see docs/adapters.md.
deeprecall serve --vectorstore chroma --collection my_docs --port 8000
CLI
deeprecall init # Generate starter config
deeprecall ingest --path ./docs/ # Ingest documents
deeprecall query "question" --max-searches 10 --max-time 30
deeprecall serve --port 8000 --api-keys "key1,key2"
deeprecall delete doc_id_1 doc_id_2 # Delete documents
Project Structure
deeprecall/
├── core/ # Engine, config, guardrails, tracer, cache, callbacks, reranker
│ ├── cache.py # InMemoryCache, DiskCache (SQLite)
│ ├── cache_redis.py # RedisCache (distributed)
│ ├── callbacks.py # ConsoleCallback, JSONLCallback, UsageTrackingCallback
│ ├── callback_otel.py # OpenTelemetry distributed tracing
│ ├── async_engine.py # AsyncDeepRecall (non-blocking wrapper)
│ └── ...
├── vectorstores/ # ChromaDB, Milvus, Qdrant, Pinecone adapters
├── adapters/ # LangChain, LlamaIndex, OpenAI-compatible server
├── middleware/ # API key auth (sync + async), rate limiting (thread-safe)
├── prompts/ # System prompts for the RLM
└── cli.py # CLI entry point
tests/
├── test_concurrency.py # Thread safety & race condition tests
├── test_cache_redis.py # Redis cache unit tests
├── test_callback_otel.py # OpenTelemetry callback unit tests
└── ... # 114+ tests total
Contributing
git clone https://github.com/kothapavan1998/deeprecall.git
cd deeprecall
pip install -e ".[all]"
make check
See CONTRIBUTING.md.
Citation
Built on Recursive Language Models by Zhang, Kraska, and Khattab (MIT).
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeprecall-0.2.0.tar.gz.
File metadata
- Download URL: deeprecall-0.2.0.tar.gz
- Upload date:
- Size: 71.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aa0bce1cda69fca3354abadbb35a207eb16b86d8adf2916fc89981be0d00649
|
|
| MD5 |
bbd560affe227c0618c039b1e1c62943
|
|
| BLAKE2b-256 |
1beb3e39de41b31852788e0b2280b19f6d55562ce068a1059dcc958300fa2c4e
|
File details
Details for the file deeprecall-0.2.0-py3-none-any.whl.
File metadata
- Download URL: deeprecall-0.2.0-py3-none-any.whl
- Upload date:
- Size: 58.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59d19c7521b44d8eda7c6dd2619d7f63c100785fac6e47e5d6fd48fe36357f85
|
|
| MD5 |
65a2cfa8bdb4842e7d6cbdc7a6cf3a96
|
|
| BLAKE2b-256 |
bf1710becf4b65fce8ef99903fca8693eaaaa2fc867a12faaaf4056f52cd6f86
|