TurboQuant+ compression for vLLM. 4.3x weight compression + 3.7x KV cache, zero calibration.

These details have not been verified by PyPI

Project links

Homepage

Project description

turboquant-vllm

TurboQuant+ compression for vLLM. Three features from one algorithm:

Weight compression (4.3-4.6x) via 3-bit TQ3. Any BF16 checkpoint, compressed in 9 seconds, zero calibration. Faster than uncompressed serving.
KV cache compression (3.7x) for more concurrent conversations on the same GPU
Expert pruning via REAP saliency scoring for MoE models

Gemma 4 26B: ~52 GB checkpoint → ~12 GB runtime VRAM with TQ3. Scores 4.79/5 on our 20-scenario benchmark, comparable to Qwen3-235B AWQ (4.75/5) at 2.6x lower GPU cost. +33% throughput over BF16 at 16 concurrent requests (364 vs 275 tok/s on H100).

Qwen3-30B: 61 GB → 13 GB (4.6x). On MLA models, TQ+ KV cache works where vLLM's FP8 is broken.

from turboquant_vllm import enable_weight_quantization, patch_vllm_attention

enable_weight_quantization(bits=3)                          # 52 GB → 12 GB in 9 seconds
patch_vllm_attention(k_bits=4, v_bits=3, norm_correction=True)  # 3.7x KV compression

# Then start vLLM as usual

Why this exists

vLLM offers FP8 KV cache (2x compression). For large MoE models at production context lengths, the KV cache is the memory bottleneck, not the weights. TurboQuant+ gives 3.7-4.7x compression with minimal quality loss:

KV cache type	Compression	Per-vector overhead	Quality impact
FP16 (default)	1x	512 bytes	baseline
FP8 (vLLM built-in)	2x	256 bytes	negligible on standard attention; broken on MLA
TQ+ turbo4	3.7x	140 bytes (K: 72 + V: 68)	+0.23% PPL
TQ+ turbo3	4.7x*	108 bytes (K: 56 + V: 52)	+1.06% PPL
TQ+ asymmetric K4/V3	~4.0x	124 bytes (K: 72 + V: 52)	K precision preserved

*turbo3 with 3-bit sub-byte packing (8 indices per 3 bytes). Implemented in v0.3.0.

Norm storage is already optimal: one fp32 norm per 128-element vector (head_dim = block_size), matching the block-size optimization finding from turboquant_plus that block_size=128 eliminates redundant norm storage for free.

Benchmark results

10 configs tested on H100 80GB and A100 80GB on Verda (Helsinki, Finland). 20 multi-turn conversation scenarios (product inquiry, technical support, safety, reasoning, multilingual) scored by Llama-3.3-70B judge:

Model	KV Cache	Avg Score	Latency
Qwen3-235B AWQ	TQ+ asymmetric K4/V3	4.75	28537ms
Qwen3-235B AWQ	TQ+ turbo4	4.74	29063ms
Qwen3-235B AWQ	FP16 (baseline)	4.74	29415ms
Qwen3-235B AWQ	FP8	4.71	29971ms
Qwen3-30B FP16	FP16 (baseline)	4.73	4396ms
Qwen3-30B AWQ	FP16	4.67	3721ms
GLM-4.7-Flash BF16	TQ+ turbo3	4.63	5998ms
GLM-4.7-Flash BF16	FP16 (baseline)	4.61	6042ms
GLM-4.7-Flash BF16	TQ+ turbo4	4.58	5998ms
GLM-4.7-Flash BF16	FP8	1.07	6299ms

Key findings:

TQ+ matches or beats baseline everywhere. Qwen3-235B: asymmetric K4/V3 (4.75) >= baseline (4.74). GLM-Flash: TQ+ turbo3 (4.63) > baseline (4.61). No scenario degraded across any config.
Asymmetric K4/V3 is the winner. Highest score among all Qwen3-235B configs with better compression than symmetric turbo4. Confirms the turboquant_plus research that K precision dominates quality.
TQ+ works on MLA models. First validated benchmark of TurboQuant+ on Multi-head Latent Attention (GLM-4.7-Flash). The patch compresses MLA's latent vectors correctly.
FP8 KV cache is broken on MLA models. vLLM's FP8 KV on GLM-Flash scores 1.07/5. Single-turn responses are coherent, but multi-turn conversations degrade to garbage. Root cause: the FLASHMLA backend applies FP8 without proper per-tensor scaling, and quantization error compounds with context length. FP8 works fine on standard attention (Qwen3-235B: 4.71). TQ+ does not have this problem because PolarQuant normalizes each vector independently before quantization.

GLM-4.7 355B and DeepSeek-V3 671B benchmarks pending (require larger disk provisioning).

Tested models and known issues

Model family	Attention	KV cache TQ	Notes
Qwen3 (0.6B-235B)	GQA	Works	Tested extensively, including 235B AWQ
Qwen3-8B	GQA	Works	Native vLLM backend confirmed on A100
GLM-4.7-Flash	MLA	Works	TQ+ handles MLA correctly (FP8 does not)
DeepSeek-V3	MLA	Works	Via MLACommonImpl patch
Qwen3.5 (hybrid)	GatedDeltaNet + GQA	Untested	Hybrid architecture, may need layer-specific handling
gpt-oss-20b	Alternating full/sliding window + sinks	Not yet	Returns empty output. Sliding window + attention sinks need pass-through support

Standard GQA/MHA models work. MLA models work via the monkey-patch library. Models with non-standard attention (sliding window, attention sinks, hybrid recurrent) are not yet supported in the native backend.

GPU compatibility: Tested on A100 (SM80), RTX 6000 Ada (SM89), H100 (SM90). RTX PRO 6000 Blackwell (SM120) lacks FlashAttention-4 hardware support, which the native TQ backend currently depends on for prefill.

Install

pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git

For vLLM integration:

pip install "turboquant-plus-vllm[vllm] @ git+https://github.com/varjoranta/turboquant-vllm.git"

PyPI release coming soon.

How it works

Inspired by TurboQuant (Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026). After a random rotation, vector coordinates become easier to quantize. Our implementation uses a Gaussian Lloyd-Max codebook as an approximation. No calibration data needed.

Extended by turboquant_plus for KV cache:

K cache: PolarQuant at (b-1) bits + QJL at 1 bit = b bits total. QJL corrects inner product bias, critical because attention scores are inner products (Q @ K^T).
V cache: PolarQuant MSE-only at full b bits. No QJL needed, V is used in a weighted sum, not inner products.
Asymmetric K/V: K precision dominates quality (controls softmax routing). V can be compressed more aggressively. K4/V3 gives better compression AND better quality than symmetric turbo3.

Usage

Patch vLLM (simplest)

from turboquant_vllm import patch_vllm_attention

# Symmetric 4-bit
patch_vllm_attention(k_bits=4, v_bits=4)

# Asymmetric K4/V3 with Phase 2 features (recommended)
patch_vllm_attention(k_bits=4, v_bits=3, norm_correction=True,
                     sink_tokens=4, boundary_layers=5)

Phase 2 features: norm correction fixes magnitude shrinkage at low bit widths. Sink tokens keep the first 4 positions at FP16 (attention sinks get universal attention). Boundary layers give the first/last 5 layers K=8-bit precision (they carry more signal through the residual stream). Validated on Gemma 4 26B: token-for-token identical output to FP16 baseline at temperature=0.

The patch covers both standard FlashAttention (Qwen3, Llama, Mistral) and MLA attention (GLM-4.7-Flash, DeepSeek-V3) via MLACommonImpl.

Standalone compression (without vLLM)

from turboquant_vllm import KVCacheCompressorTorch

compressor = KVCacheCompressorTorch(
    head_dim=128, k_bits=4, v_bits=4, device="cuda"
)

# Compress
ck = compressor.compress_k(key_vectors)   # (num_tokens, head_dim) → CompressedKV
cv = compressor.compress_v(value_vectors)

# Decompress
k_restored = compressor.decompress_k(ck)  # → (num_tokens, head_dim) float32
v_restored = compressor.decompress_v(cv)

CUDA kernel compilation

CUDA kernels are JIT-compiled on first use (requires nvcc):

# Verify CUDA kernels compile and run
python -m turboquant_vllm.build

If CUDA compilation fails, the system automatically falls back to PyTorch ops (slower but functionally identical).

The CUDA kernels

KV cache kernels in csrc/turbo_quant.cu:

Kernel	Purpose	Key operation
`reshape_and_cache_kernel`	Write path	Fused: norm → normalize → WHT rotate → searchsorted → pack 4-bit
`dequant_paged_kernel`	Read path	Fused: unpack → centroid lookup → inverse WHT → rescale
`qjl_quantize_residual_kernel`	K cache QJL	PolarQuant residual → 128×128 projection → pack sign bits
`qjl_dequantize_and_add_kernel`	K cache QJL	Reconstruct QJL contribution, add to PolarQuant output

Weight dequant kernel in csrc/tq_weight_dequant.cu:

Kernel	Purpose	Key operation
`tq_weight_dequant_kernel`	Weight decompression (CUDA)	Unpack indices → codebook lookup → warp-shuffle + shared memory WHT butterfly → rescale

Triton fused dequant-GEMM in turboquant_vllm/triton_ops.py:

Kernel	Purpose	Key operation
`_polar_fused_gemm_kernel`	FWHT-on-input GEMM (fastest)	Rotate input once → codebook lookup + norm scale + dot product (no weight decompression)
`_tq_fused_gemm_kernel`	Fused weight dequant + matmul	Unpack → codebook → pre-computed rotation matrix → scale → GEMM accumulate

The fused Triton kernel is 10.5x faster than separate dequant + cuBLAS GEMM (0.57ms vs 5.9ms for 4096×4096 on A100). It eliminates the intermediate decompressed weight buffer entirely. Uses a pre-computed rotation matrix (128×128 = 64 KB, computed once) instead of the WHT butterfly, turning the inverse rotation into a small matmul that Triton optimizes well.

The CUDA kernel (5x faster than PyTorch) serves as fallback when Triton is unavailable. Uses warp-shuffle operations for intra-warp butterfly stages, shared memory only for cross-warp stages.

Design choices:

Four-tier dispatch: Triton FWHT-on-input (rotates input, no weight decompression) → Triton fused dequant-GEMM → CUDA dequant + cuBLAS → PyTorch fallback. Heuristic selects FWHT-on-input for large layers (>4K output features), dequant-GEMM for small. All tiers support TQ2/TQ3/TQ4 including 3-bit sub-byte packing.
Walsh-Hadamard Transform over dense rotation: O(d log d) vs O(d²). 896 FLOPs vs 16,384 for d=128.
Separate K/V codebooks in constant memory for asymmetric bit widths.
Constant memory caching: codebook and sign vectors only re-uploaded when config changes.
4-bit packing: two indices per byte, halves cache bandwidth.
Targets A100 (sm_80), L40S/RTX4090 (sm_89), H100 (sm_90).

Bandwidth argument

At 32K context with 32 layers, 32 KV heads, head_dim=128 (typical for Qwen3-235B, Llama-70B class models):

	FP16	Turbo4
KV cache size	17.2 GB	4.6 GB
Read time at 2TB/s (A100)	8.6 ms	2.3 ms
Dequant overhead	0	~0.2 ms
Net per decode step	8.6 ms	2.5 ms

71% reduction in KV cache access time. Models with fewer KV heads (GQA) have proportionally smaller caches, but the compression ratio holds.

Compatibility

Model family	Attention type	TQ+ support	FP8 KV safe?
Qwen3, Llama, Mistral	FlashAttention (GQA/MHA)	Yes	Yes
GLM-4.7-Flash, DeepSeek-V3	Multi-head Latent Attention (MLA)	Yes	No (broken)

MLA models store a compressed latent vector (kv_c_normed) plus positional encoding (k_pe) instead of standard K/V. The patch compresses kv_c_normed with PolarQuant MSE-only and passes k_pe through uncompressed. Validated on GLM-4.7-Flash across 20 scenarios.

Weight compression

The same WHT rotation + codebook math compresses model weights. Load any BF16 checkpoint, compress at startup, serve. No calibration data.

from turboquant_vllm import enable_weight_quantization

enable_weight_quantization(bits=3)  # TQ3: best compression
# or bits=4 for TQ4 (more conservative)
# then: vllm serve google/gemma-4-26B-A4B-it

Works with vLLM V1 engine (multiprocessing spawn) via the vllm.general_plugins entry point. The hook is automatically re-applied in spawned subprocesses.

Inspired by TurboQuant (Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026). Our implementation uses a Gaussian Lloyd-Max codebook as an approximation. Weight compression inspired by @coffeecup2020's TQ3_1S proof-of-concept for llama.cpp.

Results

Model	BF16	TQ4 (4-bit)	TQ3 (3-bit)
Gemma 4 26B	52 GB	15 GB (3.4x)	12 GB (4.3x)
Qwen3-30B	61 GB	17 GB (3.6x)	13 GB (4.6x)

Gemma 4 TQ3 quality: 4.79/5 on 20 multi-turn conversation scenarios (scored by Llama-3.3-70B judge). Matches Qwen3-235B AWQ (4.75/5) at 2.6x lower GPU cost.

Throughput (H100 80GB, vLLM 0.19.0, Gemma 4 26B)

Concurrency	BF16 tok/s	TQ3 Weight	TQ3 W+KV (K4/V3)	vs BF16
1	27.4	28.3	28.2	+3%
4	103.3	103.4	102.9	0%
8	183.5	190.4	185.5	+4%
16	274.5	357.2	364.2	+33%

TQ3 is faster at every concurrency level. The gap widens under load because smaller weights need less HBM bandwidth, leaving more room for batching.

3-bit sub-byte packing: 8 indices per 3 bytes. Norm correction: stores original_norm / reconstruction_norm ratio per group to fix 5-10% magnitude shrinkage at 3-bit.

Native TQ3 checkpoint (small GPUs)

Runtime compression (enable_weight_quantization) needs the full BF16 checkpoint in GPU memory during loading. For GPUs that can't fit the original checkpoint (e.g., L40S 48GB for a 52 GB model), use a native TQ3 checkpoint instead:

from turboquant_vllm import load_tq3_model

# 12 GB checkpoint → 13.7 GB GPU peak, tested on L40S 48GB and H100 80GB
model, tokenizer = load_tq3_model("varjosoft/gemma-4-26B-A4B-it-TQ3-native")
output = model.generate(...)

The native checkpoint stores packed 3-bit indices directly (12 GB on disk). The loader creates the model on a meta device (zero memory), loads packed weights to GPU, and decompresses on-the-fly during each forward pass.

Create your own native checkpoint:

from turboquant_vllm.checkpoint import save_tq3_checkpoint
save_tq3_checkpoint("google/gemma-4-26B-A4B-it", "./gemma4-tq3-native")
# CPU only, ~60 GB RAM, ~2 minutes

Important: Gemma 4 requires transformers >= 5.5.0 (the gemma4 model type was added in that version). vLLM 0.19.0 pins transformers < 5, so Gemma 4 loading requires a manual override: pip install 'transformers>=5.5'.

Limitations

V100 16GB: model loads (12 GB) but not enough room for KV cache. Minimum practical is 24 GB.
TQ2 (2-bit) destroys quality. 4 centroids too few for MLP weight distributions.
Native TQ3 inference speed is slower than runtime compression due to per-forward-pass decompression overhead.

Expert pruning (REAP)

Integrated REAP (Cerebras, ICLR 2026) saliency scoring for MoE expert pruning. Measures actual expert contribution during inference, not just weight magnitude.

from turboquant_vllm.expert_pruning import reap_prune

reap_prune(model, tokenizer, prune_fraction=0.2, num_samples=512)

20% pruning preserves quality on Qwen3-30B. 50% pruning degrades quality (works on larger models per REAP paper).

AWQ export

Export TQ-compressed weights to AWQ format for Marlin serving speed:

from turboquant_vllm.export import compress_and_export

compress_and_export("google/gemma-4-26B-A4B-it", "./gemma4-awq", bits=4)
# ~2 minutes total (TQ compress + AutoAWQ pack)
# Serve with: vllm serve ./gemma4-awq --quantization awq

Requires AutoAWQ. Replaces hours of AWQ calibration with ~2 minutes.

Combined with KV cache compression

Both features work together:

from turboquant_vllm import enable_weight_quantization, patch_vllm_attention

enable_weight_quantization(bits=4, group_size=128)  # 59.7 GB → 16.8 GB model
patch_vllm_attention(k_bits=4, v_bits=3)            # 3.7x smaller KV cache

Same math, same CUDA kernels. Weight compression reduces the hardware you need. KV cache compression increases how many users you can serve on it.

Contributions and testing on different models welcome. Write-up: varjosoft.com/weight-compression.html

Native vLLM fork

For production use without monkey-patching, we maintain a vLLM fork with TurboQuant built in as a native attention backend:

# Install from fork
pip install git+https://github.com/varjoranta/vllm-1.git@turboquant-integration

# Use directly — no patching needed
vllm serve Qwen/Qwen3-8B --kv-cache-dtype tq3

The fork includes a standalone TurboQuantAttentionBackend with Triton/CUDA kernels, FP8 value storage for quality preservation, and asymmetric K/V support (--kv-cache-dtype tq_k4v3). Based on vllm-project/vllm#38479 with quality fixes.

This library (monkey-patch approach) remains useful for quick testing with any existing vLLM install, weight quantization, and models not yet supported by the native backend.

Fork: varjoranta/vllm-1 turboquant-integration

Serverless deployment

Deploy models to Verda GPU cloud (Helsinki) with scale-to-zero billing.

python containers/deploy.py deploy gemma4-26b-it      # A100, best quality/cost ratio
python containers/deploy.py deploy gpt-oss-20b        # L40S, cheapest good model
python containers/deploy.py deploy qwen3-235b-awq     # H200, highest quality
python containers/deploy.py pause gemma4-26b-it       # stop billing

Measured results (April 2026)

Model	Active params	GPU	Cost/hr	Cold start	Quality	Per session
Gemma 4 26B MoE	3.8B	A100 80GB	$1.29	3 min	Excellent (#6 Arena AI)	~$0.22
gpt-oss-20b	3.6B	L40S 48GB	$0.90	3.3 min	Very good	~$0.15
Qwen3-8B	8B	L40S 48GB	$0.90	3 min	Good	~$0.15
Qwen3-235B AWQ	22B	H200 141GB	$3.39	5.5 min	Best (4.75/5 benchmark)	~$0.57

Cold start = time from zero replicas to first token (model cached on persistent volume). Billing per 10-minute block.

Gemma 4 setup: Requires custom vLLM image with transformers>=5.5.0 and python3-dev. Use the instruction-tuned variant google/gemma-4-26B-A4B-it. Released April 2, 2026. Apache 2.0.

For real-time chat, always-warm is required — cold starts of 2-5 minutes are too slow for interactive use. Monthly cost (8hr/day): Gemma 4 on A100 ~$310, gpt-oss-20b on L40S ~$216. Serverless scale-to-zero works for batch/async workloads.

Code: containers/deploy.py

Related projects

turboquant-vllm on PyPI — A separate, independent implementation of TurboQuant for vLLM by Alberto-Codes. Uses Triton kernels and HuggingFace DynamicCache, targeting consumer GPUs (RTX 4090). This project differs: fused CUDA kernels for production A100/H100, asymmetric K/V bit widths (required for quantized weight models), and vLLM paged cache integration. The PyPI package for this project will be published as turboquant-plus-vllm to avoid confusion.
turbo-quant-lite — Numpy-only TurboQuant for embedding compression in databases. Same math, different codebook and use case.
turboquant_plus — Research implementation of the KV cache algorithm. This package builds production CUDA kernels on top of that work.
TQ3_1S for llama.cpp — @coffeecup2020's proof-of-concept applying TurboQuant to model weights (not just KV cache). Achieved near-Q4_0 quality at 3.5-bit. Inspired the weight quantization feature in this package.
TurboQuant paper — Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026. The underlying algorithm.
REAP — Cerebras, ICLR 2026. Router-weighted expert pruning for MoE compression.
SpinQuant — Facebook Research, ICLR 2025. Learned rotation optimization (up to 45% improvement over fixed Hadamard). Our learned_rotation.py implements a simplified version.
SqueezeLLM — ICML 2024. Sensitivity-weighted codebooks and sparse outlier extraction. Influenced our research direction.
AutoAWQ — AWQ quantization and packing library. Used in our AWQ export pipeline.

Development Process

This library was developed with the help of Spegling, a personal knowledge system built at Varjosoft. Spegling maintains a persistent wiki compiled from research papers and production systems, integrates with coding agents via MCP, and governs autonomous research with documented provenance. The research for v0.3.0 (TQ3 compression, REAP pruning, fused kernels) was conducted through Spegling analyzing relevant papers, implementing approaches, running benchmarks on Verda GPU instances, and iterating based on results. Total GPU cost: ~$18.

License

MIT, Varjosoft Oy

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.13.1

Apr 25, 2026

0.13.0

Apr 23, 2026

0.11.0

Apr 20, 2026

0.7.0

Apr 13, 2026

0.6.0

Apr 9, 2026

0.5.1

Apr 8, 2026

This version

0.5.0

Apr 8, 2026

0.4.0

Apr 7, 2026

0.3.0

Apr 6, 2026

0.2.0

Apr 4, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_plus_vllm-0.5.0.tar.gz (535.6 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_plus_vllm-0.5.0-py3-none-any.whl (76.3 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file turboquant_plus_vllm-0.5.0.tar.gz.

File metadata

Download URL: turboquant_plus_vllm-0.5.0.tar.gz
Upload date: Apr 8, 2026
Size: 535.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for turboquant_plus_vllm-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`67fce1d20920fc3291c11f99d578e2d3248bf88e84562e8b32d2c08b24ec88dc`
MD5	`ba8c103148a297e2f6f6e0d85af5e71a`
BLAKE2b-256	`1a116f0dd5980dd710582899c0aa4b57688a982b328aec4bf7a46bb5522f71cd`

See more details on using hashes here.

File details

Details for the file turboquant_plus_vllm-0.5.0-py3-none-any.whl.

File metadata

Download URL: turboquant_plus_vllm-0.5.0-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 76.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for turboquant_plus_vllm-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d38665a9794892a2510c7cc66a6829061a8f40355d591ba0f89df2ff5953c53`
MD5	`345322989fe1464fd6e6f399626c623f`
BLAKE2b-256	`2ac0606e7eac5f2b44056171decc21eb1d54b79e0ae6a70bf245d1365d3d9fe5`

See more details on using hashes here.

turboquant-plus-vllm 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

turboquant-vllm

Why this exists

Benchmark results

Tested models and known issues

Install

How it works

Usage

Patch vLLM (simplest)

Standalone compression (without vLLM)

CUDA kernel compilation

The CUDA kernels

Bandwidth argument

Compatibility

Weight compression

Results

Throughput (H100 80GB, vLLM 0.19.0, Gemma 4 26B)

Native TQ3 checkpoint (small GPUs)

Limitations

Expert pruning (REAP)

AWQ export

Combined with KV cache compression

Native vLLM fork

Serverless deployment

Measured results (April 2026)

Related projects

Development Process

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes