RWKV Inference & Serving - OpenAI-Compatible API Server with Continuous Batching

These details have not been verified by PyPI

Project links

Project description

RWKVServe

High-performance RWKV inference and serving framework, aligned with vLLM design, providing an OpenAI-compatible API with Continuous Batching.

中文文档

Features

Continuous Batching — Dynamic scheduling via SchedulerCore; short requests are never blocked by long ones; Chunked Prefill to control peak memory
OpenAI-Compatible API — Full implementation of /v1/chat/completions and /v1/completions, works directly with the OpenAI SDK
State Cache — Trie-based prefix-level state caching for accelerated repeated-prefix inference
LoRA Adapter — Load LoRA adapters and serve them online (vLLM-style --enable-lora --lora-modules name=path)
Reasoning Output — Thinking mode support (<think>...</think>), separates reasoning from the final answer via reasoning_content field
Data Parallel — Multi-GPU data-parallel inference with automatic load balancing
Multi-Model — Serve multiple models simultaneously, auto-routed by the model field
Structured Output — JSON Schema enforcement for constrained generation
vLLM-style Python API — LLM.generate() for offline batch inference with Continuous Batching over arbitrary number of prompts
API Key Auth — Multi-key authentication, configurable via CLI or environment variable

Installation

# Install from source
pip install -e .

# With structured output support
pip install -e ".[structured-output]"

# With all extras (dev tools included)
pip install -e ".[all]"

Quick Start

1. Start the API Server

# Single model
rwkvserve --model-path /path/to/model --max-batch-size 32

# With model name and dtype
rwkvserve --model-path /path/to/model --model-name rwkv-7 --dtype bf16

# Multi-model deployment
rwkvserve \
    --model model1:/path/to/model1 \
    --model model2:/path/to/model2:cuda:0

# Data parallel (multi-GPU)
rwkvserve --model model1:/path/to/model1 --gpus 0,1,2,3

2. Serve with LoRA Adapter

rwkvserve \
    --model-path /path/to/base_model \
    --enable-lora \
    --lora-modules my-lora=/path/to/lora_adapter

LoRA weights are merged into the base model at startup — zero runtime overhead. API requests select the adapter by its name via the model field (e.g., "my-lora").

3. Enable Reasoning Mode

rwkvserve \
    --model-path /path/to/model \
    --enable-reasoning --reasoning-parser deepseek_r1

When enabled, <think>...</think> content in model output is automatically extracted into the reasoning_content field, consistent with vLLM's reasoning output.

4. Call with OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="dummy", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="rwkv-7",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

# Access reasoning_content (requires --enable-reasoning on server)
msg = response.choices[0].message
if hasattr(msg, "reasoning_content") and msg.reasoning_content:
    print("Thinking:", msg.reasoning_content)
print("Answer:", msg.content)

5. Offline Batch Inference (LLM.generate)

from rwkvserve import LLM, SamplingParams

# Basic
llm = LLM(model="/path/to/model", max_batch_size=256, dtype="bf16")

# With LoRA adapter
llm = LLM(
    model="/path/to/base_model",
    enable_lora=True,
    lora_path="/path/to/lora_adapter",
    dtype="bf16",
)

params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
outputs = llm.generate(["Hello, world!"] * 1000, params, use_tqdm=True)

for output in outputs:
    print(output.outputs[0].text)

6. Command-line Inference

# Single prompt
rwkvserve-infer --model /path/to/model --prompt "Hello!" --stream

# Interactive chat
rwkvserve-infer --model /path/to/model --chat

API Endpoints

Method	Path	Description
GET	`/v1/models`	List available models
POST	`/v1/chat/completions`	Chat completion (streaming supported)
POST	`/v1/completions`	Text completion (streaming supported)
GET	`/health`	Health check
GET	`/docs`	Swagger API docs

Request Example

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rwkv-7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256,
    "temperature": 0.8,
    "stream": false
  }'

CLI Reference

rwkvserve [options]

Model:
  --model-path PATH           Path to model directory
  --model-name NAME           Model name in API (default: rwkv-7)
  --model NAME:PATH[:DEVICE]  Multi-model config (repeatable)
  --model-config FILE         YAML model config file

LoRA:
  --enable-lora               Enable LoRA adapter support
  --lora-modules NAME=PATH    LoRA module to load (repeatable)

Reasoning:
  --enable-reasoning          Enable reasoning content extraction
  --reasoning-parser NAME     Parser name (default: deepseek_r1)

Runtime:
  --device {auto,cuda,cpu}    Compute device (default: auto)
  --dtype {fp32,fp16,bf16}    Model precision
  --max-batch-size N          Max batch size (default: 32)
  --prefill-chunk-size N      Chunked prefill block size (default: 512)

Server:
  --host HOST                 Listen address (default: 0.0.0.0)
  --port PORT                 Listen port (default: 8000)
  --gpus IDS                  Data-parallel GPU list (e.g. 0,1,2,3)
  --stop                      Stop running service and clean up resources
  --api-key KEY               API key for auth (repeatable)

State Cache:
  --max-cache-memory GB       State cache memory limit (default: 4.0)
  --cache-level LEVEL         Cache level: none / exact / prefix (default: prefix)

Project Structure

rwkvserve/
├── models/            # RWKV model implementation (RWKV-7)
│   └── rwkv7/         #   Model definition, config, CUDA operators
├── inference/         # Inference engine
│   ├── scheduler_core.py    # Continuous Batching scheduler
│   ├── state_cache.py       # Trie-based State Cache
│   ├── pipeline.py          # Inference pipeline
│   └── structured_output.py # Structured output enforcement
├── api/               # OpenAI-compatible API server
│   ├── api_server.py        # FastAPI application
│   ├── async_serving_chat.py      # Chat completions handler
│   ├── async_serving_completion.py # Text completions handler
│   ├── model_manager.py    # Multi-model management & routing
│   └── protocol.py         # Request / response protocol
├── entrypoints/       # Entrypoints
│   └── llm.py         #   LLM.generate() offline batch inference
├── reasoning/         # Reasoning output parsing
│   ├── base.py        #   Abstract parser & registry
│   └── deepseek_r1.py #   <think>...</think> parser
├── peft.py            # LoRA adapter loading & weight merging
├── sampling_params.py # Sampling parameters (vLLM-style)
├── outputs.py         # Output type definitions
├── cli/               # CLI tools
│   ├── serve.py       #   rwkvserve command
│   └── infer.py       #   rwkvserve-infer command
└── data/tokenizers/   # Tokenizer implementations

Examples

The examples/ directory provides ready-to-use scripts:

Script	Description
`start_server.sh`	Start the API server with LoRA and Reasoning config
`test_server.sh`	Test API endpoints with curl
`test_openai_sdk.py`	Test chat inference with OpenAI SDK
`test_llm_generate.py`	Test offline batch inference with LLM.generate()

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

Based on the official RWKV-LM implementation
API design aligned with vLLM
Built with FastAPI and PyTorch

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rwkvserve-0.1.0.tar.gz (494.9 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rwkvserve-0.1.0-py3-none-any.whl (512.2 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file rwkvserve-0.1.0.tar.gz.

File metadata

Download URL: rwkvserve-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 494.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rwkvserve-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cef3cd6d50a93a8d6897163dc05a51853c44259e0f04f0dc6644557019dc9e70`
MD5	`3d34b0baf082f1aca5cece2db78e0e36`
BLAKE2b-256	`c5e72780b5aedddb7d140eacef9d49ae8723d396138ffa1ad94652cc292c37d0`

See more details on using hashes here.

File details

Details for the file rwkvserve-0.1.0-py3-none-any.whl.

File metadata

Download URL: rwkvserve-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 512.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rwkvserve-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`008cbf10387e245a9e105592beaec1802ef5ff0fcf9efd5a20050e9acc7a4c15`
MD5	`254554277e835f6da2144ab2037ba1a2`
BLAKE2b-256	`c9b3e550be16d23ea655dbfde252ab5d1730708e3147fb19ad488b507b711202`

See more details on using hashes here.

rwkvserve 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RWKVServe

Features

Installation

Quick Start

1. Start the API Server

2. Serve with LoRA Adapter

3. Enable Reasoning Mode

4. Call with OpenAI SDK

5. Offline Batch Inference (LLM.generate)

6. Command-line Inference

API Endpoints

Request Example

CLI Reference

Project Structure

Examples

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes