Single-user optimized inference wrapper for ExLlamaV3
Project description
exllamav3-inference
Single-user optimized inference wrapper for ExLlamaV3. Replaces the upstream 1000+ line Generator with a minimal prefill-decode loop, custom CUDA kernels, and optional FP8 KV cache.
Tested with Qwen3.5-VL-27B @ 3.0bpw EXL3 on RTX 5090.
Features
- SlimGenerator: direct prefill->decode loop, no job queue, no page table, no defrag
- Fused RMSNorm+Residual: CUDA kernel fusing
x += attn_out; y = rmsnorm(x)into a single launch (36 fewer kernel launches per token) - Fused Sampling: temperature + top-k + Gumbel noise + argmax in one kernel
- FP8 KV Cache: E4M3FN quantized KV storage (~50% VRAM savings)
- PrefixCache: snapshots system prompt KV to CPU pinned memory
- OptimizedLLM: high-level async wrapper with chat template and vision support
Requirements
- Python >= 3.13
- CUDA >= 12.8
- PyTorch >= 2.6
- ExLlamaV3 (installed separately)
Install
# Install exllamav3 first
pip install --no-build-isolation -e /path/to/exllamav3
# Install this package (compiles CUDA kernels)
MAX_JOBS=1 uv pip install --no-build-isolation -e .
Usage
OptimizedLLM (high-level API)
import asyncio
from exllamav3_opt.integration import LLMConfig, OptimizedLLM
config = LLMConfig(
max_new_tokens=256,
cache_size=4096,
temperature=0.6,
top_k=20,
)
llm = OptimizedLLM(config)
llm.load("/path/to/model")
# build_prompt uses the model's chat template (via HF tokenizer)
ids = llm.build_prompt([
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
])
# Async generate
result = asyncio.run(llm.generate(ids))
print(result)
# Async streaming
async def main():
async for token in await llm.stream(ids):
print(token, end="", flush=True)
asyncio.run(main())
SlimGenerator (low-level API)
from exllamav3 import Cache, Config, Model, Tokenizer
from exllamav3_opt.generator import SlimGenerator
config = Config.from_directory("/path/to/model")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=4096)
model.load()
tokenizer = Tokenizer.from_config(config)
gen = SlimGenerator(model, cache, tokenizer)
# Text prompt (encoded internally)
for chunk in gen.stream_tokens("Hello!", max_new_tokens=256):
print(chunk.text, end="", flush=True)
# Or pass pre-tokenized input_ids directly
ids = tokenizer.hf_chat_template(
[{"role": "user", "content": "Hello!"}],
add_generation_prompt=True,
)
for chunk in gen.stream_tokens(input_ids=ids, max_new_tokens=256):
print(chunk.text, end="", flush=True)
FP8 KV Cache
from exllamav3_opt.fp8_cache import CacheLayer_fp8
cache = Cache(model, max_num_tokens=8192, layer_type=CacheLayer_fp8)
gen = SlimGenerator(model, cache, tokenizer)
Fused Sampling
gen = SlimGenerator(
model, cache, tokenizer,
use_fused_sampling=True,
fused_temperature=0.6,
fused_top_k=20,
)
Benchmark
RTX 5090, Qwen3.5-VL-27B @ 3.0bpw EXL3:
| Configuration | tok/s | vs upstream |
|---|---|---|
| Upstream Generator | 46.8 | baseline |
| SlimGenerator + fused norm | 48.3 (decode) | +3.2% |
Project Structure
src/exllamav3_opt/
_ext/ # CUDA kernels
fused_rmsnorm_residual.cu # fused residual add + RMSNorm
fused_sampling.cu # fused temperature + top-k + Gumbel + argmax
fp8_cache_kernels.cu # FP8 E4M3FN quant/dequant
bindings.cpp # pybind11 bindings
generator.py # SlimGenerator + fused norm monkey-patch
fp8_cache.py # CacheLayer_fp8 (CacheLayer ABC)
prefix_cache.py # PrefixCache
tensor_pool.py # Pre-allocated tensor pool
integration.py # OptimizedLLM (async wrapper)
compile.py # torch.compile wrappers
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
exllamav3_inference-0.3.2.tar.gz
(19.9 kB
view details)
File details
Details for the file exllamav3_inference-0.3.2.tar.gz.
File metadata
- Download URL: exllamav3_inference-0.3.2.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4f4a8bddf1e82e230b07839821c8be4a3cedeba5054d927d5d1d90df3481d70
|
|
| MD5 |
52d351209c2c2687492babdd161bb0ea
|
|
| BLAKE2b-256 |
34131d3e908e6bc1d90dd634d1e350e261897423c4f41340e40a1218308003e8
|
Provenance
The following attestation bundles were made for exllamav3_inference-0.3.2.tar.gz:
Publisher:
release.yml on nicokim/exllamav3-inference
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exllamav3_inference-0.3.2.tar.gz -
Subject digest:
d4f4a8bddf1e82e230b07839821c8be4a3cedeba5054d927d5d1d90df3481d70 - Sigstore transparency entry: 1008309638
- Sigstore integration time:
-
Permalink:
nicokim/exllamav3-inference@2bc0ceaf81fcfb8ac7fd67ee0a5e891b6cd0bbe5 -
Branch / Tag:
refs/tags/v0.3.2 - Owner: https://github.com/nicokim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2bc0ceaf81fcfb8ac7fd67ee0a5e891b6cd0bbe5 -
Trigger Event:
push
-
Statement type: