Single-user optimized inference wrapper for ExLlamaV3
Project description
exllamav3-inference
Self-contained optimized inference for ExLlamaV3 models. Vendorizes the ExLlamaV3 runtime (from lesj0610/exllamav3 feat/flashinfer-backend-migration) so there is no external dependency — a single pip install compiles everything.
Replaces the upstream 1000+ line Generator with a minimal prefill-decode loop, custom CUDA kernels, and optional FP8 KV cache.
Features
- Vendorized ExLlamaV3: all inference modules included — no separate install needed
- SlimGenerator: direct prefill->decode loop, no job queue, no page table, no defrag
- Fused RMSNorm+Residual: CUDA kernel fusing
x += attn_out; y = rmsnorm(x)into a single launch (36 fewer kernel launches per token) - Fused Sampling: temperature + top-k + Gumbel noise + argmax in one kernel
- FP8 KV Cache: E4M3FN quantized KV storage (~50% VRAM savings)
- PrefixCache: snapshots system prompt KV + recurrent state (GatedDeltaNet) to CPU pinned memory
- OptimizedLLM: high-level async wrapper with chat template and vision support
- Vision + Video: multimodal inference with image and multi-frame video embeddings
Requirements
- Python >= 3.13
- CUDA >= 12.8
- PyTorch >= 2.6
Install
# Single install — compiles both exllamav3_ext and exllamav3_opt_ext CUDA kernels
MAX_JOBS=1 uv pip install --no-build-isolation -e .
# Optional: vision support
uv pip install pillow
Usage
OptimizedLLM (high-level API)
import asyncio
from exllamav3_opt.integration import LLMConfig, OptimizedLLM
config = LLMConfig(
model_repo="your-org/your-model-EXL3",
model_revision="bpw3.0",
max_new_tokens=256,
cache_size=4096,
temperature=0.6,
top_k=20,
)
llm = OptimizedLLM(config, hf_token="...")
model_path = llm.download()
llm.load(model_path)
# build_prompt uses the model's chat template (via HF tokenizer)
ids = llm.build_prompt([
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
])
# Async generate
result = asyncio.run(llm.generate(ids))
print(result)
# Async streaming
async def main():
async for token in await llm.stream(ids):
print(token, end="", flush=True)
asyncio.run(main())
SlimGenerator (low-level API)
from exllamav3 import Cache, Config, Model, Tokenizer
from exllamav3_opt.generator import SlimGenerator
config = Config.from_directory("/path/to/model")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=4096)
model.load()
tokenizer = Tokenizer.from_config(config)
gen = SlimGenerator(model, cache, tokenizer)
# Text prompt (encoded internally)
for chunk in gen.stream_tokens("Hello!", max_new_tokens=256):
print(chunk.text, end="", flush=True)
# Or pass pre-tokenized input_ids directly
ids = tokenizer.hf_chat_template(
[{"role": "user", "content": "Hello!"}],
add_generation_prompt=True,
)
for chunk in gen.stream_tokens(input_ids=ids, max_new_tokens=256):
print(chunk.text, end="", flush=True)
Vision / Video
from PIL import Image
# Load vision model
vision_model = Model.from_config(config, component="vision")
vision_model.load()
# Single image
image = Image.open("photo.jpg").convert("RGB")
image.thumbnail((512, 512))
emb = vision_model.get_image_embeddings(tokenizer, image)
# Build prompt with embedding alias
prompt = f"<|im_start|>user\n{emb.text_alias}Describe this image<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, add_bos=True, encode_special_tokens=True, embeddings=[emb])
response = gen.generate(input_ids=input_ids, embeddings=[emb], max_new_tokens=256)
# Video (multiple frames)
frames = [Image.open(f"frame_{i}.jpg").convert("RGB") for i in range(4)]
frame_embs = vision_model.get_image_embeddings(tokenizer, frames) # returns list
aliases = "".join(e.text_alias for e in frame_embs)
prompt = f"<|im_start|>user\n{aliases}What happens in this video?<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, add_bos=True, encode_special_tokens=True, embeddings=frame_embs)
response = gen.generate(input_ids=input_ids, embeddings=frame_embs, max_new_tokens=300)
FP8 KV Cache
from exllamav3_opt.fp8_cache import CacheLayer_fp8
cache = Cache(model, max_num_tokens=8192, layer_type=CacheLayer_fp8)
gen = SlimGenerator(model, cache, tokenizer)
Fused Sampling
gen = SlimGenerator(
model, cache, tokenizer,
use_fused_sampling=True,
fused_temperature=0.6,
fused_top_k=20,
)
Project Structure
src/exllamav3/ # Vendorized ExLlamaV3 runtime (inference-only subset)
architecture/ # ~43 model architectures (Qwen3.5, Llama, DeepSeek, etc.)
modules/ # Attention, MLP, GatedDeltaNet, linear, norms
cache/ # KV cache layers (fp16, quant, recurrent)
model/ # Model loading and config
tokenizer/ # Tokenizer + MMEmbedding (multimodal)
exllamav3_ext/ # CUDA kernels (~52 sources)
util/ # RoPE, progress, tensors, vision
src/exllamav3_opt/ # Optimized inference layer
_ext/ # Custom CUDA kernels
fused_rmsnorm_residual.cu # Fused residual add + RMSNorm
fused_sampling.cu # Fused temperature + top-k + Gumbel + argmax
fp8_cache_kernels.cu # FP8 E4M3FN quant/dequant
bindings.cpp # pybind11 bindings
generator.py # SlimGenerator + fused norm monkey-patch
fp8_cache.py # CacheLayer_fp8 (CacheLayer ABC)
prefix_cache.py # PrefixCache
tensor_pool.py # Pre-allocated tensor pool
integration.py # OptimizedLLM (async wrapper)
compile.py # torch.compile wrappers
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file exllamav3_inference-0.4.0.tar.gz.
File metadata
- Download URL: exllamav3_inference-0.4.0.tar.gz
- Upload date:
- Size: 353.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a100b6bb7f0d1f14abb80535cd9704df6d3e21ed83d673006a13dee8685a367e
|
|
| MD5 |
9c81744e1ad468d9037240d917614075
|
|
| BLAKE2b-256 |
4b1e1bcf47ae356f0b5d44d5bab871c688812a59690c13733e9ff3b10bdf0caf
|
Provenance
The following attestation bundles were made for exllamav3_inference-0.4.0.tar.gz:
Publisher:
release.yml on nicokim/exllamav3-inference
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exllamav3_inference-0.4.0.tar.gz -
Subject digest:
a100b6bb7f0d1f14abb80535cd9704df6d3e21ed83d673006a13dee8685a367e - Sigstore transparency entry: 1013868228
- Sigstore integration time:
-
Permalink:
nicokim/exllamav3-inference@7a4bb0ada9fad9eeb2d859241e549ea06afa4afa -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/nicokim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7a4bb0ada9fad9eeb2d859241e549ea06afa4afa -
Trigger Event:
push
-
Statement type: