RWKV Inference & Serving - OpenAI-Compatible API Server with Continuous Batching
Project description
RWKVServe
High-performance RWKV inference and serving framework, aligned with vLLM design, providing an OpenAI-compatible API with Continuous Batching.
Features
- Continuous Batching — Dynamic scheduling via SchedulerCore; short requests are never blocked by long ones; Chunked Prefill to control peak memory
- OpenAI-Compatible API — Full implementation of
/v1/chat/completionsand/v1/completions, works directly with the OpenAI SDK - State Cache — Trie-based prefix-level state caching for accelerated repeated-prefix inference
- LoRA Adapter — Load LoRA adapters and serve them online (vLLM-style
--enable-lora --lora-modules name=path) - Reasoning Output — Thinking mode support (
<think>...</think>), separates reasoning from the final answer viareasoning_contentfield - Data Parallel — Multi-GPU data-parallel inference with automatic load balancing
- Multi-Model — Serve multiple models simultaneously, auto-routed by the
modelfield - Structured Output — JSON Schema enforcement for constrained generation
- vLLM-style Python API —
LLM.generate()for offline batch inference with Continuous Batching over arbitrary number of prompts - API Key Auth — Multi-key authentication, configurable via CLI or environment variable
Installation
# Install from source
pip install -e .
# With structured output support
pip install -e ".[structured-output]"
# With all extras (dev tools included)
pip install -e ".[all]"
Quick Start
1. Start the API Server
# Single model
rwkvserve --model-path /path/to/model --max-batch-size 32
# With model name and dtype
rwkvserve --model-path /path/to/model --model-name rwkv-7 --dtype bf16
# Multi-model deployment
rwkvserve \
--model model1:/path/to/model1 \
--model model2:/path/to/model2:cuda:0
# Data parallel (multi-GPU)
rwkvserve --model model1:/path/to/model1 --gpus 0,1,2,3
2. Serve with LoRA Adapter
rwkvserve \
--model-path /path/to/base_model \
--enable-lora \
--lora-modules my-lora=/path/to/lora_adapter
LoRA weights are merged into the base model at startup — zero runtime overhead. API requests select the adapter by its name via the model field (e.g., "my-lora").
3. Enable Reasoning Mode
rwkvserve \
--model-path /path/to/model \
--enable-reasoning --reasoning-parser deepseek_r1
When enabled, <think>...</think> content in model output is automatically extracted into the reasoning_content field, consistent with vLLM's reasoning output.
4. Call with OpenAI SDK
from openai import OpenAI
client = OpenAI(api_key="dummy", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="rwkv-7",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
# Access reasoning_content (requires --enable-reasoning on server)
msg = response.choices[0].message
if hasattr(msg, "reasoning_content") and msg.reasoning_content:
print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)
5. Offline Batch Inference (LLM.generate)
from rwkvserve import LLM, SamplingParams
# Basic
llm = LLM(model="/path/to/model", max_batch_size=256, dtype="bf16")
# With LoRA adapter
llm = LLM(
model="/path/to/base_model",
enable_lora=True,
lora_path="/path/to/lora_adapter",
dtype="bf16",
)
params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
outputs = llm.generate(["Hello, world!"] * 1000, params, use_tqdm=True)
for output in outputs:
print(output.outputs[0].text)
6. Command-line Inference
# Single prompt
rwkvserve-infer --model /path/to/model --prompt "Hello!" --stream
# Interactive chat
rwkvserve-infer --model /path/to/model --chat
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /v1/models |
List available models |
| POST | /v1/chat/completions |
Chat completion (streaming supported) |
| POST | /v1/completions |
Text completion (streaming supported) |
| GET | /health |
Health check |
| GET | /docs |
Swagger API docs |
Request Example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "rwkv-7",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256,
"temperature": 0.8,
"stream": false
}'
CLI Reference
rwkvserve [options]
Model:
--model-path PATH Path to model directory
--model-name NAME Model name in API (default: rwkv-7)
--model NAME:PATH[:DEVICE] Multi-model config (repeatable)
--model-config FILE YAML model config file
LoRA:
--enable-lora Enable LoRA adapter support
--lora-modules NAME=PATH LoRA module to load (repeatable)
Reasoning:
--enable-reasoning Enable reasoning content extraction
--reasoning-parser NAME Parser name (default: deepseek_r1)
Runtime:
--device {auto,cuda,cpu} Compute device (default: auto)
--dtype {fp32,fp16,bf16} Model precision
--max-batch-size N Max batch size (default: 32)
--prefill-chunk-size N Chunked prefill block size (default: 512)
Server:
--host HOST Listen address (default: 0.0.0.0)
--port PORT Listen port (default: 8000)
--gpus IDS Data-parallel GPU list (e.g. 0,1,2,3)
--stop Stop running service and clean up resources
--api-key KEY API key for auth (repeatable)
State Cache:
--max-cache-memory GB State cache memory limit (default: 4.0)
--cache-level LEVEL Cache level: none / exact / prefix (default: prefix)
Project Structure
rwkvserve/
├── models/ # RWKV model implementation (RWKV-7)
│ └── rwkv7/ # Model definition, config, CUDA operators
├── inference/ # Inference engine
│ ├── scheduler_core.py # Continuous Batching scheduler
│ ├── state_cache.py # Trie-based State Cache
│ ├── pipeline.py # Inference pipeline
│ └── structured_output.py # Structured output enforcement
├── api/ # OpenAI-compatible API server
│ ├── api_server.py # FastAPI application
│ ├── async_serving_chat.py # Chat completions handler
│ ├── async_serving_completion.py # Text completions handler
│ ├── model_manager.py # Multi-model management & routing
│ └── protocol.py # Request / response protocol
├── entrypoints/ # Entrypoints
│ └── llm.py # LLM.generate() offline batch inference
├── reasoning/ # Reasoning output parsing
│ ├── base.py # Abstract parser & registry
│ └── deepseek_r1.py # <think>...</think> parser
├── peft.py # LoRA adapter loading & weight merging
├── sampling_params.py # Sampling parameters (vLLM-style)
├── outputs.py # Output type definitions
├── cli/ # CLI tools
│ ├── serve.py # rwkvserve command
│ └── infer.py # rwkvserve-infer command
└── data/tokenizers/ # Tokenizer implementations
Examples
The examples/ directory provides ready-to-use scripts:
| Script | Description |
|---|---|
start_server.sh |
Start the API server with LoRA and Reasoning config |
test_server.sh |
Test API endpoints with curl |
test_openai_sdk.py |
Test chat inference with OpenAI SDK |
test_llm_generate.py |
Test offline batch inference with LLM.generate() |
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rwkvserve-0.1.0.tar.gz.
File metadata
- Download URL: rwkvserve-0.1.0.tar.gz
- Upload date:
- Size: 494.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cef3cd6d50a93a8d6897163dc05a51853c44259e0f04f0dc6644557019dc9e70
|
|
| MD5 |
3d34b0baf082f1aca5cece2db78e0e36
|
|
| BLAKE2b-256 |
c5e72780b5aedddb7d140eacef9d49ae8723d396138ffa1ad94652cc292c37d0
|
File details
Details for the file rwkvserve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rwkvserve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 512.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
008cbf10387e245a9e105592beaec1802ef5ff0fcf9efd5a20050e9acc7a4c15
|
|
| MD5 |
254554277e835f6da2144ab2037ba1a2
|
|
| BLAKE2b-256 |
c9b3e550be16d23ea655dbfde252ab5d1730708e3147fb19ad488b507b711202
|