Minimal Recursive Language Model - Let LLMs think through code

These details have not been verified by PyPI

Project links

Project description

minrlm

minRLM is a token- and latency-efficient implementation of Recursive Language Models, benchmarked across 12 tasks against a vanilla LLM and the reference implementation. On GPT-5-mini it scores 72.7% (vs 69.7% official, 69.5% vanilla) using 3.6x fewer tokens. On GPT-5.2 the gap grows to +30pp over vanilla, winning 11 of 12 tasks. The data never enters the prompt. The cost stays roughly flat regardless of context size. Every intermediate step is Python code you can read, rerun, and debug.

The production case: 3.6x fewer tokens directly translates to the KPIs production systems are measured by - cost per query, p99 latency, and throughput. The flat token cost - independent of input size - makes capacity planning predictable rather than a function of whatever data the user sends. When a query returns the wrong answer, you read the generated code and see exactly where the retrieval went wrong.

How is this different from agents? An RLM is an agent with exactly one tool (Python REPL) that never sees the raw input. It tells the model "you have input_0 with 500K chars" and lets it write code to answer the question. Some agents already do this internally - Claude Code processes web search results through code, Cursor chunks large files instead of pasting them whole. But these are proprietary backend optimizations. RLMs make this a commodity: agentic exploration of data in a single LLM call, where context is dynamic and determined at runtime based on the task and data.

Blog post: minRLM: A Token-Efficient Recursive Language Model Implementation and Benchmark

What's in this repo

Component	Location	What it does
RLM client	`minrlm/`	Core `RLM` and `RLMReasoning` classes - the LLM <-> REPL loop
DockerREPL	`minrlm/docker_repl.py`	Sandboxed code execution via Docker + custom seccomp
Evals	`eval/`	12-task benchmark framework, runners, metrics, plot generation
Examples	`examples/`	Quickstart scripts, proxy server, Gradio side-by-side UI

Benchmarks

GPT-5-mini (primary benchmark)

1,800 evaluations | 12 tasks | 50 runs per task | 3 runners

	minRLM	Vanilla LLM	Official RLM
Accuracy	72.7%	69.5%	69.7%
Avg Tokens	8,151	20,967	29,327
Total Cost	$2.86	$4.74	$7.92

2.6x fewer tokens than vanilla | 3.6x fewer than official | 1.7x cheaper than vanilla | 2.8x cheaper than official

Model scaling

Model	minRLM	Vanilla	Delta	Tasks won by minRLM
GPT-5-nano (small)	53.7%	63.2%	-9.5	4 of 12
GPT-5-mini (mid)	72.7%	69.5%	+3.2	7 of 12
GPT-5.2 (frontier)	78.2%	48.2%	+30.0	11 of 12

The advantage grows with model capability. On GPT-5.2, minRLM wins 11 of 12 tasks - AIME 2025: 96% vs 0%, BrowseComp: 72% vs 14%, OOLONG: 96% vs 64%. The only consistent loss is RepoQA (code retrieval), where vanilla wins across all model sizes.

Charts (GPT-5-mini)

Summary Dashboard

Accuracy per Task

Token Savings vs Baselines

Tokens per Task

Cost per Query by Task

Latency per Task

Accuracy vs Cost - Efficiency Frontier

Accuracy vs Latency

Per task (GPT-5-mini)

Task	minRLM	Vanilla	Official	minRLM Tokens	vs Official Tokens
SNIAH	94%	100%	76%	6,328	2.6x fewer
OOLONG	92%	78%	80%	6,184	2.3x fewer
GDP Val	86%	54%	50%	12,007	1.7x fewer
IFEval	84%	78%	78%	5,963	1.6x fewer
MMLU-Pro	82%	90%	86%	6,341	1.3x fewer
LiveCodeBench	80%	64%	60%	7,106	1.3x fewer
AIME 2025	74%	88%	84%	7,951	1.4x fewer
GPQA Diamond	70%	66%	74%	6,679	2.1x fewer
BrowseComp	62%	16%	66%	10,740	6.4x fewer
RepoQA	62%	98%	96%	8,026	2.2x fewer
LongBench V2	46%	56%	48%	10,767	7.8x fewer
CodeQA	40%	46%	38%	9,724	8.0x fewer

minRLM uses fewer tokens than Official RLM on every task (1.3x-8.0x). Vanilla fails on BrowseComp (16%) because the context exceeds the token limit.

Full results and reproduction: eval/README.md

How it works

+---------------------------------------------------------+
|  LLM sees:                                               |
|                                                          |
|  input_0 = "string with 500000 chars"                    |
|  Task: Count errors in last hour                         |
+----------------------------------------------------------+
|  LLM writes:                                             |
|                                                          |
|  import re                                               |
|  from datetime import datetime, timedelta                |
|  errors = re.findall(r'\[ERROR\].*', input_0)            |
|  cutoff = datetime.now() - timedelta(hours=1)            |
|  FINAL(len([e for e in errors if parse_time(e) > cutoff]))
+----------------------------------------------------------+

Context is stored as input_0 in a sandboxed Python REPL
The model writes code to search/filter/aggregate it
Code runs, output goes back to the model
Repeat until FINAL(answer) is called

The data never enters the conversation. Token cost stays flat regardless of context size.

Install

pip install minrlm          # minimal - only openai required
# or
uv add minrlm

From source:

git clone https://github.com/avilum/minrlm
cd minrlm
uv sync                     # base (openai only)
uv sync --extra eval        # + benchmark runner (datasets, matplotlib, tqdm)
uv sync --extra visualizer  # + Gradio UI (gradio, plotly, pandas)
uv sync --extra proxy       # + OpenAI-compatible proxy (fastapi, uvicorn)
uv sync --extra all         # everything

1. minrlm - RLM Client

minrlm/ contains the core library:

File	Purpose
`core.py`	`RLMBase` - base recursive LLM loop
`core_reasoning.py`	`RLMReasoning` - reasoning-enhanced version (the default `RLM`)
`prompts.py`	System prompt for the base runner
`prompts_reasoning.py`	System prompt for the reasoning runner (used by benchmarks)
`docker_repl.py`	`DockerREPL` - sandboxed execution backend (see S2)

Basic usage

from minrlm import RLM gives you RLMReasoning - the version with task-adaptive reasoning that produces the benchmark numbers above. Use RLMBase if you want the bare-bones loop without reasoning prompts.

from minrlm import RLM

rlm = RLM(model="gpt-5-mini")

result = rlm.completion(
    task="How many ERROR logs in the last hour?",
    context=server_logs,          # 500K chars - never sent to the LLM
)
print(result.response)            # "147"
print(result.total_tokens)        # ~2K tokens (vs ~93K for vanilla)
print(result.iterations)          # number of code->execute cycles

Available REPL functions

Function	What it does
`input_0`	Your context data (string)
`search(text, pattern)`	Case-insensitive substring search with context windows
`peek(data)`	Preview structure of large data without printing all of it
`sub_llm(task, context)`	Recursive LLM call on a sub-chunk
`sub_llm_batch([(t,c), ...])`	Parallel batch of recursive calls
`FINAL(answer)`	Return the final answer and stop
`FINAL_var("name")`	Return a variable from the namespace

Custom endpoints

rlm = RLM(
    model="llama-3.1-70b",
    base_url="http://localhost:8000/v1",
    api_key="sk-...",
)

When to use RLM vs vanilla

Use RLM when...	Use vanilla LLM when...
Context > 50K chars	Context is short (<50K chars)
Searching or filtering data	Summarization or open-ended generation
Counting, aggregating, extracting	Holistic understanding needed
Context doesn't fit in the window	Simple Q&A on short documents

2. DockerREPL - Sandboxed Code Execution

LLM-generated code runs in an isolated Docker container with a custom seccomp profile. Docker is auto-detected and enabled if available.

from minrlm import RLM, check_docker_available

# Auto-detects Docker
rlm = RLM(model="gpt-5-mini")

# Explicit control
if check_docker_available():
    rlm = RLM(
        model="gpt-5-mini",
        use_docker=True,
        docker_memory="256m",
        docker_timeout=60,
    )

What the sandbox blocks

Restriction	How
No network access	`--network=none` + seccomp blocks `socket`, `connect`, `bind`, ...
Read-only filesystem	`--read-only` (writable `/tmp` only)
Memory cap	`--memory=256m` (configurable)
CPU cap	`--cpus=1.0` (configurable)
Process limit	`--pids-limit=100`
Kernel module loading	seccomp: `init_module`, `finit_module` blocked
Mount operations	seccomp: `mount`, `umount` blocked
ptrace / debugging	seccomp: `ptrace` blocked

Container lifecycle

Every container is assigned a unique name (minrlm_<pid>_<n>) and tracked process-wide. Containers are automatically killed when:

The container finishes (normal exit via --rm)
The execution times out (subprocess.TimeoutExpired -> docker kill)
The parent Python process exits normally (atexit hook)
The parent process receives SIGTERM or SIGINT (signal handlers)

No zombie containers after a crash or Ctrl+C.

Custom seccomp policy

Extend or replace the seccomp profile

Edit SECCOMP_PROFILE in minrlm/docker_repl.py:

SECCOMP_PROFILE = {
    "defaultAction": "SCMP_ACT_ALLOW",
    "syscalls": [
        {"names": ["socket"], "action": "SCMP_ACT_ERRNO", "errnoRet": 1},
        # add more restrictions...
    ],
}

Or subclass DockerREPL to inject a different profile at runtime.

Tip: use gVisor as the Docker runtime for an additional kernel isolation layer.

Note: sub_llm() is supported in Docker mode via a retry protocol - the container signals requests to the host, which calls the LLM and re-runs the container with cached results.

3. Evals

eval/ is a self-contained benchmark framework covering 12 tasks across 3 model sizes (GPT-5-nano, GPT-5-mini, GPT-5.2).

File	Purpose
`quickstart.py`	Smoke test - one task, two runners, instant feedback
`run.py`	Full benchmark runner with parallelism, logging, and result export
`tasks.py`	12 benchmark tasks (S-NIAH, OOLONG, CodeQA, LongBench-v2, RepoQA, BrowseComp+, GDP Val, AIME 2025, GPQA Diamond, MMLU-Pro, IFEval, LiveCodeBench)
`runners.py`	Runner implementations: `vanilla`, `minrlm`, `minrlm-reasoning`, `official`
`metrics.py`	`EvalResult`, `AggregatedMetrics`, cost calculation, markdown report generation
`plotting.py`	8 standalone plots (accuracy, tokens, latency, cost, efficiency scatter)
`README.md`	Full benchmark results and reproduction steps

Quick start

uv sync --extra eval
export OPENAI_API_KEY="your-key"

# Smoke test (one task, ~1 min)
uv run python eval/quickstart.py

# Single task, 10 runs
uv run python eval/run.py --model gpt-5-mini --tasks official_sniah --runs 10

# All tasks, single runner, 50 runs each
uv run python eval/run.py \
    --model gpt-5-mini \
    --tasks all \
    --runners minrlm-reasoning \
    --runs 50 \
    --parallel 5 \
    --output-dir logs/my_eval

# Full multi-runner benchmark (reproduces the table above)
uv run python eval/run.py \
    --tasks all \
    --runners minrlm-reasoning,vanilla,official \
    --runs 50 --parallel 12 --task-parallel 12 \
    --output-dir logs/my_eval

Visualize results

# Generate 8 plots from any eval JSON
uv run python -m eval.plotting logs/my_eval/eval_20260302.json

# Auto-discover newest JSON in a directory tree
uv run python -m eval.plotting logs/my_eval/

# Custom output directory
uv run python -m eval.plotting logs/my_eval/ reports/my_eval_plots/

Plots generated: accuracy per task, tokens per task, latency per task, cost per task, accuracy vs cost (efficiency frontier), accuracy vs latency, token savings vs baselines, summary dashboard.

See eval/README.md for all tasks, flags, and full results.

4. Examples

examples/ contains runnable scripts for common use cases.

`minimal.py` - Vanilla LLM vs RLM

Side-by-side comparison on a single task. Good starting point.

uv run python examples/minimal.py
MINRLM_MODEL=gpt-5-mini uv run python examples/minimal.py

`advanced_usage.py` - Search, sub_llm, callbacks

Demonstrates search(), sub_llm(), step callbacks, and multi-context usage.

uv run python examples/advanced_usage.py

`visualizer.py` - Gradio side-by-side UI

Interactive web app for comparing runners on evaluation tasks or custom prompts. Shows generated code, token usage, and timing for each step.

uv sync --extra visualizer
uv run python examples/visualizer.py      # http://localhost:7860

`proxy.py` - OpenAI-compatible proxy server

Drop-in replacement for the OpenAI API. Large contexts (>50K chars) are automatically routed through RLM; short contexts pass through directly.

uv sync --extra proxy
uv run uvicorn examples.proxy:app --host 0.0.0.0 --port 8000
MINRLM_VERBOSE=1 uv run uvicorn examples.proxy:app --port 8000   # verbose

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "Print powers of 2 up to 1M"}],
)

See examples/proxy_example.py for more.

Environment variables for the proxy:

export OPENAI_API_KEY="your-key"
export RLM_MODEL="gpt-5-mini"
export RLM_USE_DOCKER="true"
export PORT="8000"
export MINRLM_VERBOSE="1"

Why RLMs?

No context window limit - data lives in the REPL, not the prompt. 10M chars costs the same as 10K
Flat token cost - ~5-8K tokens regardless of input size. Predictable cost per query at scale
Measurable KPIs - accuracy, tokens, latency, and cost tracked per query. No black-box hope
Deterministic retrieval - Python code extracts data, not attention. Inspectable, reproducible
Dynamic context - the LLM decides what to look at based on the task, not you
Any LLM - works with any OpenAI-compatible endpoint (OpenAI, Anthropic, local models)

Credits

minrlm is built by Avi Lumelsky. This is an independent implementation - not a fork of the official code. The prompts, reasoning engine, eval framework, Docker sandboxing, and proxy server are all original work.

The RLM concept comes from Zhang, Kraska, and Khattab:

@misc{zhang2025recursivelanguagemodels,
      title={Recursive Language Models},
      author={Alex L. Zhang and Tim Kraska and Omar Khattab},
      year={2025},
      eprint={2512.24601},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.24601},
}

Paper: arxiv.org/abs/2512.24601 Official implementation: github.com/alexzhang13/rlm

License

MIT

I'm a security researcher. This is far from production-grade security - but it's fucking cool. Use Docker mode (default when Docker is installed) - the custom seccomp policy blocks network syscalls and most dangerous operations. For extra isolation, use gVisor as the Docker runtime.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Apr 5, 2026

0.1.2

Mar 16, 2026

This version

0.1.1

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

minrlm-0.1.1-py3-none-any.whl (52.5 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file minrlm-0.1.1-py3-none-any.whl.

File metadata

Download URL: minrlm-0.1.1-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 52.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.13

File hashes

Hashes for minrlm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98c2fa40e0d8ba89ae92ec944b7aaae1326df3ff768018eaf86281e34f7c3a51`
MD5	`aefeb0262d43dfe36ed49c5f664870da`
BLAKE2b-256	`30b88406fcd591ffd16f8e057acab1ba5584a3619cdf9f643e525150f773b220`

See more details on using hashes here.

minrlm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

minrlm

What's in this repo

Benchmarks

GPT-5-mini (primary benchmark)

Model scaling

Charts (GPT-5-mini)

Per task (GPT-5-mini)

How it works

Install

1. minrlm - RLM Client

Basic usage

Available REPL functions

Custom endpoints

When to use RLM vs vanilla

2. DockerREPL - Sandboxed Code Execution

What the sandbox blocks

Container lifecycle

Custom seccomp policy

3. Evals

Quick start

Visualize results

4. Examples

minimal.py - Vanilla LLM vs RLM

advanced_usage.py - Search, sub_llm, callbacks

visualizer.py - Gradio side-by-side UI

proxy.py - OpenAI-compatible proxy server

Why RLMs?

Credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`minimal.py` - Vanilla LLM vs RLM

`advanced_usage.py` - Search, sub_llm, callbacks

`visualizer.py` - Gradio side-by-side UI

`proxy.py` - OpenAI-compatible proxy server