Recursive Language Models — process arbitrarily long prompts by offloading context into a REPL and enabling symbolic recursion via sub-LLM calls

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dschulmeist

These details have not been verified by PyPI

Project description

replm = REPL + LM

What is replm? replm is a lightweight Python library that wraps any OpenAI-compatible client and turns your LLM into an RLM.

Recursive Language Models — process arbitrarily long prompts by offloading context into a REPL and enabling symbolic recursion via sub-LLM calls.

Based on the paper Recursive Language Models (Zhang, Kraska & Khattab, 2025).

What is an RLM?

Standard LLMs break down when prompts exceed their context window, and quality degrades well before the hard limit (context rot). An RLM fixes this by treating the prompt as a variable in a persistent REPL environment rather than feeding it into the model's token budget. The model writes code to peek at, decompose, and recursively call itself over slices of the context.

            User prompt (arbitrarily long)
                     │
                     ▼
         ┌───────────────────────┐
         │    REPL Environment   │
         │   context = <prompt>  │◄── model writes code here
         │   llm_query(...)      │    (peek, chunk, sub-call)
         └───────┬───────────────┘
                 │
            only metadata
            (length, prefix)
                 │
                 ▼
         ┌───────────────────────┐
         │      Root LLM         │
         │  generates code +     │
         │  reasoning each turn  │
         └───────────────────────┘

Key properties:

The full prompt never enters the LLM's context window. Only metadata (length, a short prefix) does.
Stdout is truncated before being shown to the root model, forcing it to use variables and sub-calls.
Symbolic recursion: the llm_query() function is callable inside REPL code, so the model can launch O(|P|) or even O(|P|^2) sub-processes over programmatic slices of the input.

Installation

uv add replm

Or from source with development dependencies:

uv sync --group dev

Requirements: Python 3.10+. The openai package is needed for OpenAI-compatible providers; alternatively, implement the LLMClient protocol for any other backend.

Quick Start

from openai import OpenAI
from replm import RLMWrapper, RLMConfig

client = RLMWrapper(
    OpenAI(api_key="sk-..."),
    root_model="gpt-5.2",
)

response = client.generate(
    query="What is the main argument of this book?",
    context=very_long_text,  # string or list[str]
)

print(response.answer)            # final answer string
print(response.iterations)        # REPL loop iterations used
print(response.sub_calls)         # sub-LLM calls made
print(response.elapsed_seconds)   # wall-clock time
print(response.cost)              # USD cost (if pricing configured)

Advanced Usage

Separate root and sub-call models

Use a powerful model for orchestration and a cheaper one for sub-calls:

client = RLMWrapper(
    OpenAI(api_key="sk-..."),
    root_model="gpt-5.2",
    sub_model="gpt-5-mini",
    config=RLMConfig(
        max_iterations=30,
        max_sub_calls=1000,
        verbose=True,
    ),
)

Async generation

Use agenerate() with an async client for concurrent sub-calls via llm_query_batch:

from openai import AsyncOpenAI
from replm import RLMWrapper

client = RLMWrapper(
    AsyncOpenAI(api_key="sk-..."),
    root_model="gpt-5.2",
    sub_model="gpt-5-mini",
)

response = await client.agenerate(
    query="Summarize all documents.",
    context=list_of_documents,
)

Token-by-token streaming

Stream root model tokens as they arrive using astream_generate():

from openai import AsyncOpenAI
from replm import RLMWrapper

client = RLMWrapper(AsyncOpenAI(api_key="sk-..."), root_model="gpt-5.2")

async for chunk in client.astream_generate("Summarize.", very_long_text):
    if chunk.type == "token":
        print(chunk.content, end="", flush=True)
    elif chunk.type == "final_answer":
        response = chunk.detail["response"]
        print(f"\n\nTokens: {response.total_input_tokens + response.total_output_tokens}")

Chunk types: "token", "iteration_start", "code_executed", "final_answer".

If the client supports native streaming (has an astream() method), tokens arrive in real-time. Otherwise, falls back to acomplete() and yields the full response as a single chunk.

Event callbacks for observability

def on_event(event):
    print(f"[iter {event.iteration}] {event.type}: {event.preview[:80]}")

response = client.generate(
    query="Find all entities mentioned in these documents.",
    context=list_of_documents,
    on_event=on_event,
)

OpenAI-compatible providers

Works with any provider that exposes the OpenAI chat completions API:

# Together AI
client = RLMWrapper(
    OpenAI(api_key="...", base_url="https://api.together.xyz/v1"),
    root_model="Qwen/Qwen3-Coder-480B-A35B-Instruct",
)

# Fireworks
client = RLMWrapper(
    OpenAI(api_key="fw-...", base_url="https://api.fireworks.ai/inference/v1"),
    root_model="accounts/fireworks/models/qwen3-coder-480b-a35b",
)

Custom providers

For non-OpenAI backends, implement the LLMClient protocol directly:

from replm import RLMWrapper, CompletionResult

class MyClient:
    def complete(self, model, messages, temperature, max_tokens):
        # Call your LLM here
        return CompletionResult(content="...", input_tokens=0, output_tokens=0)

client = RLMWrapper(MyClient(), root_model="my-model")

OpenAI SDK clients are auto-wrapped in OpenAIAdapter — no changes needed for existing code.

Multi-document context

Pass a list of strings to process many documents:

response = client.generate(
    query="Which documents mention climate change?",
    context=["doc 1 text...", "doc 2 text...", ...],
)

Cost tracking

Configure per-token pricing to get cost estimates:

config = RLMConfig(
    cost_per_input_token=2.50 / 1_000_000,
    cost_per_output_token=10.0 / 1_000_000,
)
client = RLMWrapper(OpenAI(api_key="sk-..."), root_model="gpt-5.2", config=config)
response = client.generate(query="...", context=long_text)
print(f"Cost: ${response.cost:.4f}")

Sub-call caching

Avoid redundant API calls when the same sub-call prompt is issued multiple times within a single generation:

config = RLMConfig(cache_sub_calls=True)

Cache hits are free — they don't count against the sub-call budget. The cache is per-generation (not persisted across calls) and uses LRU eviction with a 10,000-entry default.

OpenTelemetry tracing

Install the optional tracing dependency to get automatic span instrumentation:

uv add "replm[tracing]"

When opentelemetry-api is installed, spans are emitted automatically:

rlm.generate — root generation run (attributes: query length, model, iterations, tokens, elapsed time)
rlm.sub_call — each sub-LLM call (attributes: depth, prompt length, tokens)

When OTel is not installed, tracing is a zero-cost no-op — no code changes needed.

No-sub-calls ablation

Reproduce the paper's "RLM (no sub-calls)" ablation — the model uses REPL code only, no llm_query:

config = RLMConfig(enable_sub_calls=False)

Configuration

All options live in RLMConfig:

Parameter	Default	Description
`max_iterations`	`25`	Max REPL loop iterations for the root model
`max_sub_calls`	`500`	Max total sub-LLM calls per generation
`max_recursion_depth`	`1`	Nesting depth (1 = plain sub-calls, 2+ = recursive)
`cache_sub_calls`	`False`	Cache identical sub-call prompts within a run
`enable_sub_calls`	`True`	Set `False` for the no-sub-calls ablation
`metadata_prefix_chars`	`1000`	Characters of stdout shown to the root model
`sub_call_max_input_chars`	`500000`	Max chars per sub-call input
`temperature`	`0.6`	Root model temperature
`sub_temperature`	`0.4`	Sub-call temperature
`reasoning_effort`	`None`	Root model reasoning effort (`"low"`, `"medium"`, `"high"`)
`root_max_tokens`	`16384`	Max output tokens per root iteration
`sub_max_tokens`	`8192`	Max output tokens per sub-call
`sandbox_timeout`	`120`	Timeout (seconds) per REPL execution
`sandbox_mode`	`"restricted"`	`"restricted"`, `"subprocess"`, or `"none"`
`prompt_variant`	`"default"`	`"default"`, `"cost_warning"`, or `"small_context"`
`cost_per_input_token`	`0.0`	USD per input token (enables `response.cost`)
`cost_per_output_token`	`0.0`	USD per output token (enables `response.cost`)
`verbose`	`False`	Print debug logs

Response Object

RLMResponse contains:

answer — the final answer string
iterations — number of root loop iterations
sub_calls — total sub-LLM invocations
total_input_tokens / total_output_tokens — aggregated token usage
cache_hits — sub-call cache hits (when cache_sub_calls=True)
cost — estimated USD cost (based on configured per-token pricing)
elapsed_seconds — wall-clock time for the generation
history — full execution trace (list[HistoryEntry])
repl_variables — final REPL state (variable names to repr strings)

Sandboxing

The REPL executes model-generated code, so sandboxing is on by default. Three modes are available via sandbox_mode:

`"restricted"` (default)

In-process sandbox that blocks dangerous operations while allowing the standard-library modules needed for data processing:

Blocked: os, subprocess, sys, shutil, socket, file I/O (open), code execution (eval, exec, compile), and all other non-whitelisted modules
Allowed: re, json, math, collections, itertools, functools, datetime, hashlib, csv, statistics, random, textwrap, copy, base64, urllib.parse, and more

Zero overhead — runs in the same process with a restricted __builtins__ dict and a custom import hook.

`"subprocess"`

Full process isolation. Code runs in a child process via multiprocessing:

Real timeout enforcement — process.kill() terminates stuck code
Auto-recovery — a new child is spawned after a timeout, with user variables restored
llm_query and llm_query_batch are proxied back to the parent through IPC
Restricted builtins are also applied inside the child

config = RLMConfig(sandbox_mode="subprocess", sandbox_timeout=30)

`"none"`

No restrictions. Code runs with full access to the Python runtime. Use only in trusted environments or when you need access to blocked modules.

config = RLMConfig(sandbox_mode="none")

Architecture

src/replm/
├── __init__.py            # Public API
├── wrapper.py             # RLMWrapper — main entry point
├── client.py              # LLMClient protocol + OpenAIAdapter
├── orchestrator.py        # Root REPL loop (Algorithm 1)
├── async_orchestrator.py  # Async variant with concurrent sub-calls
├── stream.py              # StreamOrchestrator + StreamChunk (token streaming)
├── repl.py                # REPL environment: exec, variables
├── sub_caller.py          # Sub-LLM call manager (sync)
├── async_sub_caller.py    # Sub-LLM call manager (async)
├── budget.py              # SharedBudget for global sub-call limits
├── cache.py               # LRU cache for sub-call responses
├── tracing.py             # OpenTelemetry spans (no-op when OTel absent)
├── parser.py              # Parse code blocks + FINAL directives
├── prompt.py              # System prompt templates (Appendix C.1)
├── metadata.py            # Truncation logic
├── config.py              # RLMConfig dataclass
├── types.py               # RLMResponse, RLMEvent, HistoryEntry
├── exceptions.py          # RLMError hierarchy
└── sandbox/
    ├── __init__.py            # Sandbox public API
    ├── restricted.py          # Safe builtins + import whitelist
    └── subprocess_executor.py # Child process with IPC

Development

git clone https://github.com/dschulmeist/replm.git
cd replm
uv sync --group dev
uv run pytest

Running tests

uv run pytest                        # all tests
uv run pytest tests/test_parser.py   # specific module
uv run pytest -v --tb=short          # verbose with short tracebacks

Linting

uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run mypy src/

Limitations

replm uses prompted RLMs — it wraps any existing LLM without fine-tuning. The original paper also trained a dedicated model (RLM-Qwen3-8B) that outperformed the prompted approach by 28.3% on average. Key differences and current limitations:

No fine-tuning. The paper's fine-tuned RLM-Qwen3-8B made fewer REPL mistakes, used sub-calls more efficiently, and achieved lower inference costs. replm relies on prompting alone, so quality depends heavily on the base model's coding ability and instruction following.
Model-dependent behavior. The system prompt was originally designed for GPT-5. Different models exhibit different decomposition strategies — some (e.g. Qwen3-Coder) tend to launch excessive sub-calls, while smaller models may struggle with REPL interaction entirely. Use prompt_variant to mitigate this.
Cost variance. While the median RLM run is comparable in cost to vanilla LLM calls, the distribution has a long tail. Complex queries can trigger deep chains of sub-calls, making individual runs significantly more expensive than the average.
Output parsing brittleness. Distinguishing the final answer from intermediate reasoning can be fragile. Models occasionally output their plan as the final answer, or misuse FINAL / FINAL_VAR syntax.
Latency. Each REPL iteration requires a full round-trip to the LLM. Sync mode processes sub-calls sequentially — use agenerate() with llm_query_batch for parallelism.

For tasks within the model's native context window, a vanilla LLM call is simpler and faster. RLMs shine when inputs exceed the context window or when quality degrades due to context rot (the paper shows GPT-5 dropping to near 0% on quadratic-complexity tasks at 256K tokens, while RLMs maintain ~43%).

Roadmap

External sandboxing backends (Docker, E2B)
Anthropic / Google adapters (beyond OpenAI-compatible)
Structured logging with timing data per event
Multi-modal context (images/PDFs as context chunks)
Tool use integration (web search, calculator, databases)

Citations

@article{zhang2025rlm,
  title={Recursive Language Models},
  author={Zhang, Alex L. and Kraska, Tim and Khattab, Omar},
  journal={arXiv preprint arXiv:2512.24601},
  year={2025}
}

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dschulmeist

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

replm-0.1.0.tar.gz (262.5 kB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

replm-0.1.0-py3-none-any.whl (49.0 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file replm-0.1.0.tar.gz.

File metadata

Download URL: replm-0.1.0.tar.gz
Upload date: Feb 10, 2026
Size: 262.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for replm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d7876cf380420a2a9eb53a540de355d617364fa8b8d3ea2910dcd26d51c1cad2`
MD5	`2e3cca85a9cc764aaccb20d4cea696bf`
BLAKE2b-256	`38d6e2c25b68bfbc52861fa9a21e2d77e0c98329f3bbf6b5a25f4553f1346304`

See more details on using hashes here.

Provenance

The following attestation bundles were made for replm-0.1.0.tar.gz:

Publisher: publish.yml on dschulmeist/replm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: replm-0.1.0.tar.gz
- Subject digest: d7876cf380420a2a9eb53a540de355d617364fa8b8d3ea2910dcd26d51c1cad2
- Sigstore transparency entry: 938453158
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: dschulmeist/replm@b85ed52585d68179afb86bbbfc57068b9ec35317
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/dschulmeist
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b85ed52585d68179afb86bbbfc57068b9ec35317
- Trigger Event: push

File details

Details for the file replm-0.1.0-py3-none-any.whl.

File metadata

Download URL: replm-0.1.0-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 49.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for replm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d23ce58b98c1b84f36336b3eab49e2ccc55cb21a64c44926f7f50608125f333d`
MD5	`85af4ebbc8a7682585c82272aa890f06`
BLAKE2b-256	`a0fbff1ede7cc36ef7229ee1b55f0469f0abb308193c6e04e61f081b753c7c1b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for replm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on dschulmeist/replm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: replm-0.1.0-py3-none-any.whl
- Subject digest: d23ce58b98c1b84f36336b3eab49e2ccc55cb21a64c44926f7f50608125f333d
- Sigstore transparency entry: 938453176
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: dschulmeist/replm@b85ed52585d68179afb86bbbfc57068b9ec35317
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/dschulmeist
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b85ed52585d68179afb86bbbfc57068b9ec35317
- Trigger Event: push

replm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

replm = REPL + LM

What is an RLM?

Installation

Quick Start

Advanced Usage

Separate root and sub-call models

Async generation

Token-by-token streaming

Event callbacks for observability

OpenAI-compatible providers

Custom providers

Multi-document context

Cost tracking

Sub-call caching

OpenTelemetry tracing

No-sub-calls ablation

Configuration

Response Object

Sandboxing

"restricted" (default)

"subprocess"

"none"

Architecture

Development

Running tests

Linting

Limitations

Roadmap

Citations

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`"restricted"` (default)

`"subprocess"`

`"none"`