Minimal Recursive Language Model - Let LLMs think through code
Project description
minrlm
minRLM is a token- and latency-efficient implementation of Recursive Language Models, benchmarked across 12 tasks against a vanilla LLM and the reference implementation. On GPT-5-mini it scores 72.7% (vs 69.7% official, 69.5% vanilla) using 3.6x fewer tokens. On GPT-5.2 the gap grows to +30pp over vanilla, winning 11 of 12 tasks. The data never enters the prompt. The cost stays roughly flat regardless of context size. Every intermediate step is Python code you can read, rerun, and debug.
The production case: 3.6x fewer tokens directly translates to the KPIs production systems are measured by - cost per query, p99 latency, and throughput. The flat token cost - independent of input size - makes capacity planning predictable rather than a function of whatever data the user sends. When a query returns the wrong answer, you read the generated code and see exactly where the retrieval went wrong.
How is this different from agents? An RLM is an agent with exactly one tool (Python REPL) that never sees the raw input. It tells the model "you have input_0 with 500K chars" and lets it write code to answer the question. Some agents already do this internally - Claude Code processes web search results through code, Cursor chunks large files instead of pasting them whole. But these are proprietary backend optimizations. RLMs make this a commodity: agentic exploration of data in a single LLM call, where context is dynamic and determined at runtime based on the task and data.
Blog post: minRLM: A Token-Efficient Recursive Language Model Implementation and Benchmark
What's in this repo
| Component | Location | What it does |
|---|---|---|
| RLM client | minrlm/ |
Core RLM and RLMReasoning classes - the LLM <-> REPL loop |
| DockerREPL | minrlm/docker_repl.py |
Sandboxed code execution via Docker + custom seccomp |
| Evals | eval/ |
12-task benchmark framework, runners, metrics, plot generation |
| Examples | examples/ |
Quickstart scripts, proxy server, Gradio side-by-side UI |
Benchmarks
GPT-5-mini (primary benchmark)
1,800 evaluations | 12 tasks | 50 runs per task | 3 runners
| minRLM | Vanilla LLM | Official RLM | |
|---|---|---|---|
| Accuracy | 72.7% | 69.5% | 69.7% |
| Avg Tokens | 8,151 | 20,967 | 29,327 |
| Total Cost | $2.86 | $4.74 | $7.92 |
2.6x fewer tokens than vanilla | 3.6x fewer than official | 1.7x cheaper than vanilla | 2.8x cheaper than official
Model scaling
| Model | minRLM | Vanilla | Delta | Tasks won by minRLM |
|---|---|---|---|---|
| GPT-5-nano (small) | 53.7% | 63.2% | -9.5 | 4 of 12 |
| GPT-5-mini (mid) | 72.7% | 69.5% | +3.2 | 7 of 12 |
| GPT-5.2 (frontier) | 78.2% | 48.2% | +30.0 | 11 of 12 |
The advantage grows with model capability. On GPT-5.2, minRLM wins 11 of 12 tasks - AIME 2025: 96% vs 0%, BrowseComp: 72% vs 14%, OOLONG: 96% vs 64%. The only consistent loss is RepoQA (code retrieval), where vanilla wins across all model sizes.
Charts (GPT-5-mini)
Per task (GPT-5-mini)
| Task | minRLM | Vanilla | Official | minRLM Tokens | vs Official Tokens |
|---|---|---|---|---|---|
| SNIAH | 94% | 100% | 76% | 6,328 | 2.6x fewer |
| OOLONG | 92% | 78% | 80% | 6,184 | 2.3x fewer |
| GDP Val | 86% | 54% | 50% | 12,007 | 1.7x fewer |
| IFEval | 84% | 78% | 78% | 5,963 | 1.6x fewer |
| MMLU-Pro | 82% | 90% | 86% | 6,341 | 1.3x fewer |
| LiveCodeBench | 80% | 64% | 60% | 7,106 | 1.3x fewer |
| AIME 2025 | 74% | 88% | 84% | 7,951 | 1.4x fewer |
| GPQA Diamond | 70% | 66% | 74% | 6,679 | 2.1x fewer |
| BrowseComp | 62% | 16% | 66% | 10,740 | 6.4x fewer |
| RepoQA | 62% | 98% | 96% | 8,026 | 2.2x fewer |
| LongBench V2 | 46% | 56% | 48% | 10,767 | 7.8x fewer |
| CodeQA | 40% | 46% | 38% | 9,724 | 8.0x fewer |
minRLM uses fewer tokens than Official RLM on every task (1.3x-8.0x). Vanilla fails on BrowseComp (16%) because the context exceeds the token limit.
Full results and reproduction: eval/README.md
How it works
+---------------------------------------------------------+
| LLM sees: |
| |
| input_0 = "string with 500000 chars" |
| Task: Count errors in last hour |
+----------------------------------------------------------+
| LLM writes: |
| |
| import re |
| from datetime import datetime, timedelta |
| errors = re.findall(r'\[ERROR\].*', input_0) |
| cutoff = datetime.now() - timedelta(hours=1) |
| FINAL(len([e for e in errors if parse_time(e) > cutoff]))
+----------------------------------------------------------+
- Context is stored as
input_0in a sandboxed Python REPL - The model writes code to search/filter/aggregate it
- Code runs, output goes back to the model
- Repeat until
FINAL(answer)is called
The data never enters the conversation. Token cost stays flat regardless of context size.
Install
pip install minrlm # minimal - only openai required
# or
uv add minrlm
From source:
git clone https://github.com/avilum/minrlm
cd minrlm
uv sync # base (openai only)
uv sync --extra eval # + benchmark runner (datasets, matplotlib, tqdm)
uv sync --extra visualizer # + Gradio UI (gradio, plotly, pandas)
uv sync --extra proxy # + OpenAI-compatible proxy (fastapi, uvicorn)
uv sync --extra all # everything
1. minrlm - RLM Client
minrlm/ contains the core library:
| File | Purpose |
|---|---|
core.py |
RLMBase - base recursive LLM loop |
core_reasoning.py |
RLMReasoning - reasoning-enhanced version (the default RLM) |
prompts.py |
System prompt for the base runner |
prompts_reasoning.py |
System prompt for the reasoning runner (used by benchmarks) |
docker_repl.py |
DockerREPL - sandboxed execution backend (see S2) |
Basic usage
from minrlm import RLM gives you RLMReasoning - the version with task-adaptive reasoning that produces the benchmark numbers above. Use RLMBase if you want the bare-bones loop without reasoning prompts.
from minrlm import RLM
rlm = RLM(model="gpt-5-mini")
result = rlm.completion(
task="How many ERROR logs in the last hour?",
context=server_logs, # 500K chars - never sent to the LLM
)
print(result.response) # "147"
print(result.total_tokens) # ~2K tokens (vs ~93K for vanilla)
print(result.iterations) # number of code->execute cycles
Available REPL functions
| Function | What it does |
|---|---|
input_0 |
Your context data (string) |
search(text, pattern) |
Case-insensitive substring search with context windows |
peek(data) |
Preview structure of large data without printing all of it |
sub_llm(task, context) |
Recursive LLM call on a sub-chunk |
sub_llm_batch([(t,c), ...]) |
Parallel batch of recursive calls |
FINAL(answer) |
Return the final answer and stop |
FINAL_var("name") |
Return a variable from the namespace |
Custom endpoints
rlm = RLM(
model="llama-3.1-70b",
base_url="http://localhost:8000/v1",
api_key="sk-...",
)
When to use RLM vs vanilla
| Use RLM when... | Use vanilla LLM when... |
|---|---|
| Context > 50K chars | Context is short (<50K chars) |
| Searching or filtering data | Summarization or open-ended generation |
| Counting, aggregating, extracting | Holistic understanding needed |
| Context doesn't fit in the window | Simple Q&A on short documents |
2. DockerREPL - Sandboxed Code Execution
LLM-generated code runs in an isolated Docker container with a custom seccomp profile. Docker is auto-detected and enabled if available.
from minrlm import RLM, check_docker_available
# Auto-detects Docker
rlm = RLM(model="gpt-5-mini")
# Explicit control
if check_docker_available():
rlm = RLM(
model="gpt-5-mini",
use_docker=True,
docker_memory="256m",
docker_timeout=60,
)
What the sandbox blocks
| Restriction | How |
|---|---|
| No network access | --network=none + seccomp blocks socket, connect, bind, ... |
| Read-only filesystem | --read-only (writable /tmp only) |
| Memory cap | --memory=256m (configurable) |
| CPU cap | --cpus=1.0 (configurable) |
| Process limit | --pids-limit=100 |
| Kernel module loading | seccomp: init_module, finit_module blocked |
| Mount operations | seccomp: mount, umount blocked |
| ptrace / debugging | seccomp: ptrace blocked |
Container lifecycle
Every container is assigned a unique name (minrlm_<pid>_<n>) and tracked process-wide. Containers are automatically killed when:
- The container finishes (normal exit via
--rm) - The execution times out (
subprocess.TimeoutExpired->docker kill) - The parent Python process exits normally (
atexithook) - The parent process receives
SIGTERMorSIGINT(signal handlers)
No zombie containers after a crash or Ctrl+C.
Custom seccomp policy
Extend or replace the seccomp profile
Edit SECCOMP_PROFILE in minrlm/docker_repl.py:
SECCOMP_PROFILE = {
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{"names": ["socket"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1},
# add more restrictions...
],
}
Or subclass DockerREPL to inject a different profile at runtime.
Tip: use gVisor as the Docker runtime for an additional kernel isolation layer.
Note:
sub_llm()is supported in Docker mode via a retry protocol - the container signals requests to the host, which calls the LLM and re-runs the container with cached results.
3. Evals
eval/ is a self-contained benchmark framework covering 12 tasks across 3 model sizes (GPT-5-nano, GPT-5-mini, GPT-5.2).
| File | Purpose |
|---|---|
quickstart.py |
Smoke test - one task, two runners, instant feedback |
run.py |
Full benchmark runner with parallelism, logging, and result export |
tasks.py |
12 benchmark tasks (S-NIAH, OOLONG, CodeQA, LongBench-v2, RepoQA, BrowseComp+, GDP Val, AIME 2025, GPQA Diamond, MMLU-Pro, IFEval, LiveCodeBench) |
runners.py |
Runner implementations: vanilla, minrlm, minrlm-reasoning, official |
metrics.py |
EvalResult, AggregatedMetrics, cost calculation, markdown report generation |
plotting.py |
8 standalone plots (accuracy, tokens, latency, cost, efficiency scatter) |
README.md |
Full benchmark results and reproduction steps |
Quick start
uv sync --extra eval
export OPENAI_API_KEY="your-key"
# Smoke test (one task, ~1 min)
uv run python eval/quickstart.py
# Single task, 10 runs
uv run python eval/run.py --model gpt-5-mini --tasks official_sniah --runs 10
# All tasks, single runner, 50 runs each
uv run python eval/run.py \
--model gpt-5-mini \
--tasks all \
--runners minrlm-reasoning \
--runs 50 \
--parallel 5 \
--output-dir logs/my_eval
# Full multi-runner benchmark (reproduces the table above)
uv run python eval/run.py \
--tasks all \
--runners minrlm-reasoning,vanilla,official \
--runs 50 --parallel 12 --task-parallel 12 \
--output-dir logs/my_eval
Visualize results
# Generate 8 plots from any eval JSON
uv run python -m eval.plotting logs/my_eval/eval_20260302.json
# Auto-discover newest JSON in a directory tree
uv run python -m eval.plotting logs/my_eval/
# Custom output directory
uv run python -m eval.plotting logs/my_eval/ reports/my_eval_plots/
Plots generated: accuracy per task, tokens per task, latency per task, cost per task, accuracy vs cost (efficiency frontier), accuracy vs latency, token savings vs baselines, summary dashboard.
See eval/README.md for all tasks, flags, and full results.
4. Examples
examples/ contains runnable scripts for common use cases.
minimal.py - Vanilla LLM vs RLM
Side-by-side comparison on a single task. Good starting point.
uv run python examples/minimal.py
MINRLM_MODEL=gpt-5-mini uv run python examples/minimal.py
advanced_usage.py - Search, sub_llm, callbacks
Demonstrates search(), sub_llm(), step callbacks, and multi-context usage.
uv run python examples/advanced_usage.py
visualizer.py - Gradio side-by-side UI
Interactive web app for comparing runners on evaluation tasks or custom prompts. Shows generated code, token usage, and timing for each step.
uv sync --extra visualizer
uv run python examples/visualizer.py # http://localhost:7860
proxy.py - OpenAI-compatible proxy server
Drop-in replacement for the OpenAI API. Large contexts (>50K chars) are automatically routed through RLM; short contexts pass through directly.
uv sync --extra proxy
uv run uvicorn examples.proxy:app --host 0.0.0.0 --port 8000
MINRLM_VERBOSE=1 uv run uvicorn examples.proxy:app --port 8000 # verbose
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": "Print powers of 2 up to 1M"}],
)
See examples/proxy_example.py for more.
Environment variables for the proxy:
export OPENAI_API_KEY="your-key"
export RLM_MODEL="gpt-5-mini"
export RLM_USE_DOCKER="true"
export PORT="8000"
export MINRLM_VERBOSE="1"
Why RLMs?
- No context window limit - data lives in the REPL, not the prompt. 10M chars costs the same as 10K
- Flat token cost - ~5-8K tokens regardless of input size. Predictable cost per query at scale
- Measurable KPIs - accuracy, tokens, latency, and cost tracked per query. No black-box hope
- Deterministic retrieval - Python code extracts data, not attention. Inspectable, reproducible
- Dynamic context - the LLM decides what to look at based on the task, not you
- Any LLM - works with any OpenAI-compatible endpoint (OpenAI, Anthropic, local models)
Credits
minrlm is built by Avi Lumelsky. This is an independent implementation - not a fork of the official code. The prompts, reasoning engine, eval framework, Docker sandboxing, and proxy server are all original work.
The RLM concept comes from Zhang, Kraska, and Khattab:
@misc{zhang2025recursivelanguagemodels,
title={Recursive Language Models},
author={Alex L. Zhang and Tim Kraska and Omar Khattab},
year={2025},
eprint={2512.24601},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.24601},
}
Paper: arxiv.org/abs/2512.24601 Official implementation: github.com/alexzhang13/rlm
License
MIT
I'm a security researcher. This is far from production-grade security - but it's fucking cool. Use Docker mode (default when Docker is installed) - the custom seccomp policy blocks network syscalls and most dangerous operations. For extra isolation, use gVisor as the Docker runtime.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file minrlm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: minrlm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 52.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98c2fa40e0d8ba89ae92ec944b7aaae1326df3ff768018eaf86281e34f7c3a51
|
|
| MD5 |
aefeb0262d43dfe36ed49c5f664870da
|
|
| BLAKE2b-256 |
30b88406fcd591ffd16f8e057acab1ba5584a3619cdf9f643e525150f773b220
|