Skip to main content

RWKV Inference & Serving - OpenAI-Compatible API Server with Continuous Batching

Project description

RWKVServe

Python 3.8+ License

High-performance RWKV inference and serving framework, aligned with vLLM design, providing an OpenAI-compatible API with Continuous Batching.

中文文档

Features

  • Continuous Batching — Dynamic scheduling via SchedulerCore; short requests are never blocked by long ones; Chunked Prefill to control peak memory
  • OpenAI-Compatible API — Full implementation of /v1/chat/completions and /v1/completions, works directly with the OpenAI SDK
  • State Cache — Trie-based prefix-level state caching for accelerated repeated-prefix inference
  • LoRA Adapter — Load LoRA adapters and serve them online (vLLM-style --enable-lora --lora-modules name=path)
  • Reasoning Output — Thinking mode support (<think>...</think>), separates reasoning from the final answer via reasoning_content field
  • Data Parallel — Multi-GPU data-parallel inference with automatic load balancing
  • Multi-Model — Serve multiple models simultaneously, auto-routed by the model field
  • Structured Output — JSON Schema enforcement for constrained generation
  • vLLM-style Python APILLM.generate() for offline batch inference with Continuous Batching over arbitrary number of prompts
  • API Key Auth — Multi-key authentication, configurable via CLI or environment variable

Installation

# Install from source
pip install -e .

# With structured output support
pip install -e ".[structured-output]"

# With all extras (dev tools included)
pip install -e ".[all]"

Quick Start

1. Start the API Server

# Single model
rwkvserve --model-path /path/to/model --max-batch-size 32

# With model name and dtype
rwkvserve --model-path /path/to/model --model-name rwkv-7 --dtype bf16

# Multi-model deployment
rwkvserve \
    --model model1:/path/to/model1 \
    --model model2:/path/to/model2:cuda:0

# Data parallel (multi-GPU)
rwkvserve --model model1:/path/to/model1 --gpus 0,1,2,3

2. Serve with LoRA Adapter

rwkvserve \
    --model-path /path/to/base_model \
    --enable-lora \
    --lora-modules my-lora=/path/to/lora_adapter

LoRA weights are merged into the base model at startup — zero runtime overhead. API requests select the adapter by its name via the model field (e.g., "my-lora").

3. Enable Reasoning Mode

rwkvserve \
    --model-path /path/to/model \
    --enable-reasoning --reasoning-parser deepseek_r1

When enabled, <think>...</think> content in model output is automatically extracted into the reasoning_content field, consistent with vLLM's reasoning output.

4. Call with OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="dummy", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="rwkv-7",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

# Access reasoning_content (requires --enable-reasoning on server)
msg = response.choices[0].message
if hasattr(msg, "reasoning_content") and msg.reasoning_content:
    print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)

5. Offline Batch Inference (LLM.generate)

from rwkvserve import LLM, SamplingParams

# Basic
llm = LLM(model="/path/to/model", max_batch_size=256, dtype="bf16")

# With LoRA adapter
llm = LLM(
    model="/path/to/base_model",
    enable_lora=True,
    lora_path="/path/to/lora_adapter",
    dtype="bf16",
)

params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
outputs = llm.generate(["Hello, world!"] * 1000, params, use_tqdm=True)

for output in outputs:
    print(output.outputs[0].text)

6. Command-line Inference

# Single prompt
rwkvserve-infer --model /path/to/model --prompt "Hello!" --stream

# Interactive chat
rwkvserve-infer --model /path/to/model --chat

API Endpoints

Method Path Description
GET /v1/models List available models
POST /v1/chat/completions Chat completion (streaming supported)
POST /v1/completions Text completion (streaming supported)
GET /health Health check
GET /docs Swagger API docs

Request Example

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rwkv-7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256,
    "temperature": 0.8,
    "stream": false
  }'

CLI Reference

rwkvserve [options]

Model:
  --model-path PATH           Path to model directory
  --model-name NAME           Model name in API (default: rwkv-7)
  --model NAME:PATH[:DEVICE]  Multi-model config (repeatable)
  --model-config FILE         YAML model config file

LoRA:
  --enable-lora               Enable LoRA adapter support
  --lora-modules NAME=PATH    LoRA module to load (repeatable)

Reasoning:
  --enable-reasoning          Enable reasoning content extraction
  --reasoning-parser NAME     Parser name (default: deepseek_r1)

Runtime:
  --device {auto,cuda,cpu}    Compute device (default: auto)
  --dtype {fp32,fp16,bf16}    Model precision
  --max-batch-size N          Max batch size (default: 32)
  --prefill-chunk-size N      Chunked prefill block size (default: 512)

Server:
  --host HOST                 Listen address (default: 0.0.0.0)
  --port PORT                 Listen port (default: 8000)
  --gpus IDS                  Data-parallel GPU list (e.g. 0,1,2,3)
  --stop                      Stop running service and clean up resources
  --api-key KEY               API key for auth (repeatable)

State Cache:
  --max-cache-memory GB       State cache memory limit (default: 4.0)
  --cache-level LEVEL         Cache level: none / exact / prefix (default: prefix)

Project Structure

rwkvserve/
├── models/            # RWKV model implementation (RWKV-7)
│   └── rwkv7/         #   Model definition, config, CUDA operators
├── inference/         # Inference engine
│   ├── scheduler_core.py    # Continuous Batching scheduler
│   ├── state_cache.py       # Trie-based State Cache
│   ├── pipeline.py          # Inference pipeline
│   └── structured_output.py # Structured output enforcement
├── api/               # OpenAI-compatible API server
│   ├── api_server.py        # FastAPI application
│   ├── async_serving_chat.py      # Chat completions handler
│   ├── async_serving_completion.py # Text completions handler
│   ├── model_manager.py    # Multi-model management & routing
│   └── protocol.py         # Request / response protocol
├── entrypoints/       # Entrypoints
│   └── llm.py         #   LLM.generate() offline batch inference
├── reasoning/         # Reasoning output parsing
│   ├── base.py        #   Abstract parser & registry
│   └── deepseek_r1.py #   <think>...</think> parser
├── peft.py            # LoRA adapter loading & weight merging
├── sampling_params.py # Sampling parameters (vLLM-style)
├── outputs.py         # Output type definitions
├── cli/               # CLI tools
│   ├── serve.py       #   rwkvserve command
│   └── infer.py       #   rwkvserve-infer command
└── data/tokenizers/   # Tokenizer implementations

Examples

The examples/ directory provides ready-to-use scripts:

Script Description
start_server.sh Start the API server with LoRA and Reasoning config
test_server.sh Test API endpoints with curl
test_openai_sdk.py Test chat inference with OpenAI SDK
test_llm_generate.py Test offline batch inference with LLM.generate()

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rwkvserve-0.1.0.tar.gz (494.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rwkvserve-0.1.0-py3-none-any.whl (512.2 kB view details)

Uploaded Python 3

File details

Details for the file rwkvserve-0.1.0.tar.gz.

File metadata

  • Download URL: rwkvserve-0.1.0.tar.gz
  • Upload date:
  • Size: 494.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rwkvserve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cef3cd6d50a93a8d6897163dc05a51853c44259e0f04f0dc6644557019dc9e70
MD5 3d34b0baf082f1aca5cece2db78e0e36
BLAKE2b-256 c5e72780b5aedddb7d140eacef9d49ae8723d396138ffa1ad94652cc292c37d0

See more details on using hashes here.

File details

Details for the file rwkvserve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rwkvserve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 512.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rwkvserve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 008cbf10387e245a9e105592beaec1802ef5ff0fcf9efd5a20050e9acc7a4c15
MD5 254554277e835f6da2144ab2037ba1a2
BLAKE2b-256 c9b3e550be16d23ea655dbfde252ab5d1730708e3147fb19ad488b507b711202

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page