Skip to main content

Lightweight model gateway for capturing LLM call traces during RL agent training

Project description

rllm-model-gateway

Lightweight model gateway for capturing LLM call traces during RL agent training. Sits between agents and inference servers (vLLM), transparently recording token IDs, logprobs, and conversation data — with zero modifications to agent code.

Quick Start

# Create a uv environment
uv venv --python 3.11
source .venv/bin/activate

# Install
uv pip install -e .

# Set up pre-commit hooks (one-time, from the rllm repo root)
cd .. && pre-commit install && cd rllm-model-gateway

# Start with a vLLM worker
rllm-model-gateway --port 9090 --worker http://localhost:8000/v1

# Or with a config file
rllm-model-gateway --config gateway.yaml

Agent Side (Zero rLLM Dependencies)

from openai import OpenAI

client = OpenAI(
    base_url=f"http://localhost:9090/sessions/{session_id}/v1",
    api_key="EMPTY",
)
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B",
    messages=[{"role": "user", "content": "Hello"}],
)

Works with any OpenAI-compatible agent framework (ADK, Strands, LangChain, OpenAI Agents SDK, etc.).

Training Side

from rllm_model_gateway import GatewayClient

client = GatewayClient("http://localhost:9090")

# Create session and get URL for the agent
session_id = client.create_session()
agent_url = client.get_session_url(session_id)
# → "http://localhost:9090/sessions/{session_id}/v1"

# After agent runs, retrieve traces with full token data
traces = client.get_session_traces(session_id)
for trace in traces:
    print(trace.prompt_token_ids)       # From vLLM's return_token_ids
    print(trace.completion_token_ids)   # Per-token IDs, no retokenization needed
    print(trace.logprobs)               # Per-token logprobs

Features

  • Zero agent coupling — Agents use standard OpenAI(base_url=...), no rLLM imports
  • Zero retokenization — Token IDs captured directly from vLLM responses
  • Partial rollout recovery — Traces persisted per-call, survive agent crashes
  • Session-sticky routing — Multi-turn sessions routed to the same worker for prefix caching
  • Streaming support — SSE streaming with real-time chunk forwarding and trace assembly
  • Pluggable storage — SQLite (default), in-memory (testing), extensible to DynamoDB/PostgreSQL
  • Lightweight — 6 dependencies, no torch/ray/verl/transformers

Development

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[dev]"

# Unit tests
python -m pytest tests/unit/ -x -q

# Integration tests (requires vLLM on localhost:4000, auto-skipped otherwise)
python -m pytest tests/integration/ -x -v

Configuration

CLI

rllm-model-gateway \
  --port 9090 \
  --db-path ./traces.db \
  --worker http://vllm-0:8000/v1 \
  --worker http://vllm-1:8000/v1

YAML (--config gateway.yaml)

host: "0.0.0.0"
port: 9090
db_path: "~/.rllm/gateway.db"

workers:
  - url: "http://vllm-0:8000/v1"
    model_name: "Qwen/Qwen2.5-7B-Instruct"
  - url: "http://vllm-1:8000/v1"
    model_name: "Qwen/Qwen2.5-7B-Instruct"

Environment Variables

RLLM_GATEWAY_HOST, RLLM_GATEWAY_PORT, RLLM_GATEWAY_DB_PATH, RLLM_GATEWAY_LOG_LEVEL, RLLM_GATEWAY_STORE

Embedded Usage

from rllm_model_gateway import create_app, GatewayConfig

config = GatewayConfig(port=9090, workers=[...])
app = create_app(config)

import threading, uvicorn
threading.Thread(target=uvicorn.run, args=(app,), kwargs={"port": 9090}, daemon=True).start()

Dynamic Worker Registration

Workers can be added at runtime via the admin API — useful for verl integration where vLLM addresses are only known after initialization:

client = GatewayClient("http://localhost:9090")
client.add_worker(url="http://vllm-worker-0:8000/v1", model_name="Qwen/Qwen2.5-7B")

API Overview

Endpoint Description
POST /sessions/{sid}/v1/chat/completions Proxy (agent-facing, OpenAI-compatible)
POST /sessions Create session with metadata
GET /sessions/{sid}/traces Retrieve traces for a session
POST /admin/workers Register a worker
GET /health Gateway health check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rllm_model_gateway-0.1.0.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rllm_model_gateway-0.1.0-py3-none-any.whl (27.0 kB view details)

Uploaded Python 3

File details

Details for the file rllm_model_gateway-0.1.0.tar.gz.

File metadata

  • Download URL: rllm_model_gateway-0.1.0.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for rllm_model_gateway-0.1.0.tar.gz
Algorithm Hash digest
SHA256 11be2368ca9c1b81ce2639d6451ab9054e90a55f5c46e70971d4cd6d7a335612
MD5 46499e73681f03ceb3c112810a0dd848
BLAKE2b-256 5b456134fa839037a425a54c6e63abeb22afbd8902564eca3890d430fb87f87a

See more details on using hashes here.

File details

Details for the file rllm_model_gateway-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rllm_model_gateway-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3963661cac8f29e803a725f13abfe719f3e1b94bf2885b17ffc009eafb92e562
MD5 554cebc0ab6de858cc3e6e54f712763c
BLAKE2b-256 39734c88d3b6f369c6643d5efa6d80e3a69099d5cba00b7a82d7f0b2ed30cb78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page