Skip to main content

Single-user optimized inference wrapper for ExLlamaV3

Project description

exllamav3-inference

CI

Single-user optimized inference wrapper for ExLlamaV3. Replaces the upstream 1000+ line Generator with a minimal prefill-decode loop, custom CUDA kernels, and optional FP8 KV cache.

Built for kohai-v2 VTuber AI system running Qwen3.5-VL-27B @ 3.0bpw EXL3.

Features

  • SlimGenerator: direct prefill->decode loop, no job queue, no page table, no defrag
  • Fused RMSNorm+Residual: CUDA kernel fusing x += attn_out; y = rmsnorm(x) into a single launch (36 fewer kernel launches per token)
  • Fused Sampling: temperature + top-k + Gumbel noise + argmax in one kernel
  • FP8 KV Cache: E4M3FN quantized KV storage (~50% VRAM savings)
  • PrefixCache: snapshots system prompt KV to CPU pinned memory
  • OptimizedLLM: drop-in replacement for kohai-v2's LLM class

Requirements

  • Python >= 3.13
  • CUDA >= 12.8
  • PyTorch >= 2.6
  • ExLlamaV3 (installed separately)

Install

# Install exllamav3 fork first
pip install --no-build-isolation -e /path/to/exllamav3

# Install this package (compiles CUDA kernels)
MAX_JOBS=1 uv pip install --no-build-isolation -e .

Usage

from exllamav3 import Cache, Config, Model, Tokenizer
from exllamav3_opt.generator import SlimGenerator

config = Config.from_directory("/path/to/model")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=4096)
model.load()
tokenizer = Tokenizer.from_config(config)

gen = SlimGenerator(model, cache, tokenizer)

# Streaming
for chunk in gen.stream_tokens("Hello!", max_new_tokens=256):
    print(chunk.text, end="", flush=True)

FP8 KV Cache

from exllamav3_opt.fp8_cache import CacheLayer_fp8

cache = Cache(model, max_num_tokens=8192, layer_type=CacheLayer_fp8)
gen = SlimGenerator(model, cache, tokenizer)

Fused Sampling

gen = SlimGenerator(
    model, cache, tokenizer,
    use_fused_sampling=True,
    fused_temperature=0.6,
    fused_top_k=20,
)

Benchmark

python test_with_model.py                  # default: fused norm ON
python test_with_model.py --no-fused-norm  # compare without fusion
python test_with_model.py --fp8-cache      # test FP8 VRAM savings

RTX 5090, Qwen3.5-VL-27B @ 3.0bpw EXL3:

Configuration tok/s vs upstream
Upstream Generator 46.8 baseline
SlimGenerator + fused norm 48.3 (decode) +3.2%

Project Structure

src/exllamav3_opt/
  _ext/                     # CUDA kernels
    fused_rmsnorm_residual.cu  # fused residual add + RMSNorm
    fused_sampling.cu          # fused temperature + top-k + Gumbel + argmax
    fp8_cache_kernels.cu       # FP8 E4M3FN quant/dequant
    bindings.cpp               # pybind11 bindings
  generator.py              # SlimGenerator + fused norm monkey-patch
  fp8_cache.py              # CacheLayer_fp8 (CacheLayer ABC)
  prefix_cache.py           # PrefixCache
  tensor_pool.py            # Pre-allocated tensor pool
  integration.py            # OptimizedLLM (kohai-v2 drop-in)
  compile.py                # torch.compile wrappers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exllamav3_inference-0.3.0.tar.gz (20.0 kB view details)

Uploaded Source

File details

Details for the file exllamav3_inference-0.3.0.tar.gz.

File metadata

  • Download URL: exllamav3_inference-0.3.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exllamav3_inference-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a421651b86d1e07be69b9f0789512a019988e046b62b9f0f2c434c093a742c33
MD5 e348efb6b080e5fdec5165f5c2ae8882
BLAKE2b-256 1ffc5f64a1d377c18a4025db44de8c39d902712d5e0ae6d7cc4d9a891f3e8eca

See more details on using hashes here.

Provenance

The following attestation bundles were made for exllamav3_inference-0.3.0.tar.gz:

Publisher: release.yml on nicokim/exllamav3-inference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page