llama.cpp + TurboQuant — Hadamard-rotation preprocessor for LLM weights, plus a unified CLI on top of llama-cpp-python

These details have not been verified by PyPI

Project links

Project description

turbocpp

llama.cpp + TurboQuant. Every llama.cpp feature, plus an offline Hadamard-rotation preprocessor that meaningfully improves the quality of any quantization (Q4_0 / Q4_K_M / Q6_K / …) at zero inference cost.


🚀 Live demo	huggingface.co/spaces/AIencoder/turboquant-visualizer
📦 Python package	`pip install turbocpp` (PyPI) — `[runtime]` extra adds llama-cpp-python
🐳 Docker images	`docker pull ghcr.io/ary5272/turbocpp:cpu` (also `:server`, `:turboquant`)
🔧 Wheel mirror	datasets/AIencoder/TurboCpp_Wheels — prebuilt llama-cpp-python for every CPU feature combo

Install

# From PyPI (recommended — pulls llama-cpp-python source build via [runtime]):
pip install 'turbocpp[runtime]'

# If your CPU/OS lacks a build toolchain, skip [runtime] and install
# llama-cpp-python from a prebuilt wheel matched to this host:
pip install turbocpp
pip install $(turbocpp pick-wheel)

# From the GitHub Release (always points at the latest tag — useful in
# environments where PyPI is mirrored / blocked):
pip install https://github.com/Ary5272/turbocpp/releases/latest/download/turbocpp-py3-none-any.whl

After install, turbocpp doctor reports what's wired (color-coded PASS / WARN / FAIL with one line per check):

$ turbocpp doctor
turbocpp 0.20.0 doctor - linux
  [PASS]  python ≥ 3.10                          3.11.9
  [PASS]  cpu feature variant                    basic_avx2_fma_f16c
  [PASS]  llama-cpp-python                       0.3.16
  [PASS]  llama-cpp-python GPU offload           yes
  [PASS]  docker on PATH                         /usr/bin/docker
  [PASS]  image ghcr.io/ggml-org/llama.cpp:full  cached locally
  [PASS]  GPU                                    nvidia (nvidia-smi)
  [PASS]  torch (rotate)                         2.4.0 (cuda)
  [PASS]  HF wheel URL reachable                 https://huggingface.co/...

Pipe-friendly: turbocpp doctor --no-color strips ANSI escapes; turbocpp doctor --no-network skips the wheel HEAD probe.

After install you get a turbocpp CLI:

turbocpp rotate      ./Llama-3-8B  ./Llama-3-8B-tq        # offline Hadamard rotation
turbocpp generate    -m model.gguf -p "Hello" -n 64        # one-shot inference
turbocpp serve       -m model.gguf --host 0.0.0.0 --port 8080
turbocpp speculative -m target.gguf -d draft.gguf -p "..." # 1.5-3× faster decode
turbocpp pick-wheel                                         # auto-pick fastest wheel
turbocpp pick-wheel  --gpu cuda12                           # GPU variant URL
turbocpp bench                                              # rotation/quant MSE microbench

# `-m` accepts: a local GGUF, a config alias, or a HuggingFace ref. The
# ref is downloaded on first use into ~/.cache/turbocpp/models/ and cached.
turbocpp generate -m TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hi"
turbocpp generate -m hf://TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hi"

# every llama.cpp tool, no submodule, no compile — pulls ggml-org/llama.cpp:full
turbocpp convert    /models/Llama-3-8B    --outfile /models/m.gguf
turbocpp quantize   /models/m.gguf  /models/m-Q4_K_M.gguf  Q4_K_M
turbocpp perplexity -m /models/m-Q4_K_M.gguf -f /data/wiki.test.raw
turbocpp imatrix    -m /models/m.gguf -f /data/calib.txt -o imatrix.dat
turbocpp llama-cli  -m /models/m.gguf -p "Hello"
turbocpp llama-bench -m /models/m.gguf
turbocpp llama       <any-tool>           # raw passthrough

Security

All .github/workflows/*.yml actions are pinned to commit SHAs (Dependabot keeps them current).
Wheels and Docker images carry SLSA build provenance attestations — verify with gh attestation verify <file> --owner Ary5272.
Weekly gitleaks + CodeQL scans on main.
See SECURITY.md for vulnerability reporting.

Get actual speedups, not just better quality

# (1) Auto-install the fastest llama-cpp-python wheel for your CPU
#     (AVX-512 / VNNI / AMX automatically chosen):
pip install $(turbocpp pick-wheel)

# (2) Speculative decoding — biggest single decode win, no kernels needed.
#     Smaller draft proposes K tokens; bigger target verifies in one pass.
turbocpp speculative \
    -m  Llama-3-8B-tq-Q4_K_M.gguf      \
    -d  Llama-3-8B-tq-Q2_K.gguf        \
    -p  "Explain quantization." -n 256 -k 4

# (3) End-to-end head-to-head benchmark (4-way matrix):
./scripts/bench_speculative.sh /path/to/HF/Llama-3-8B

The CPU-tier auto-pick alone gives ~10-30% over the AVX2 default on Sapphire Rapids / Zen4. Speculative decoding stacks another 1.5-3× on top. Together: 2-4× over a stock pip install llama-cpp-python flow.

Docker

Three images on GHCR. All three install llama-cpp-python from a prebuilt wheel at AIencoder/TurboCpp_Wheels — no source compile, ~30s image build instead of ~10 min, runs on any x86_64 host with AVX2 + FMA + F16C.

image	what's inside	size
`ghcr.io/ary5272/turbocpp:cpu`	turbocpp CLI + llama-cpp-python (prebuilt wheel)	~500 MB
`ghcr.io/ary5272/turbocpp:server`	inherits `:cpu`, ENTRYPOINT = `turbocpp serve` on `:8080`	~500 MB
`ghcr.io/ary5272/turbocpp:turboquant`	inherits `:cpu`, adds CPU-only PyTorch for `rotate`	~2.0 GB

# Inference runtime + unified CLI
docker run --rm -v ~/models:/models ghcr.io/ary5272/turbocpp:cpu \
       generate -m /models/m.gguf -p "Hello" -n 64

# OpenAI-compatible HTTP server on :8080
docker run --rm -p 8080:8080 -v ~/models:/models ghcr.io/ary5272/turbocpp:server \
       -m /models/m.gguf

# Offline Hadamard rotation
docker run --rm -v ~/models:/models ghcr.io/ary5272/turbocpp:turboquant \
       rotate /models/Llama-3-8B /models/Llama-3-8B-tq

A new image is pushed to GHCR on every main commit and every v* tag (docker.yml). Build locally pointing at a different prebuilt wheel (e.g. AVX-512 / VNNI / Sapphire Rapids):

docker build --target cpu \
    --build-arg LLAMA_CPP_WHEEL_URL="$(turbocpp pick-wheel --gpu cuda12)" \
    -t turbocpp:cpu-cuda12 .

   ┌───────────────────────────────────────────────────────────────┐
   │ HF model ──► turboquant rotate ──► llama.cpp convert+quantize │
   │                                                ▼              │
   │                            standard GGUF, runs anywhere       │
   │                            llama.cpp does — every backend,    │
   │                            every architecture, every sampler  │
   └───────────────────────────────────────────────────────────────┘

Layout

path	purpose
`ghcr.io/ggml-org/llama.cpp:full`	upstream ggml-org/llama.cpp, pulled at runtime via Docker — the inference engine, every quantization format, every GPU backend (CUDA / Metal / Vulkan / SYCL / ROCm), HTTP server, samplers, grammars, ~50 model architectures. We stopped vendoring llama.cpp as a git submodule in 0.5.0 so you always get whatever ggml-org's latest stable image is, without us pinning a stale commit. The `turbocpp llama <tool>` and `turbocpp convert / quantize / perplexity / imatrix / llama-cli / llama-bench` subcommands all forward into this image.
`turboquant/`	the differentiator — Python package that applies Walsh-Hadamard rotation to a HuggingFace model before quantization. Output is a standard rotated HF checkpoint that you feed to `convert_hf_to_gguf.py` unmodified
`extras/standalone/`	a parallel from-scratch C++17 implementation written earlier in the project. Pure CPU, AVX2/AVX-512, K-quants, GQA, YaRN, mirostat, beam search, GBNF subset, OpenAI-compat HTTP server. Useful as a study reference and a lighter-weight runtime when you don't need llama.cpp's full footprint

Why "llama.cpp + TurboQuant"

llama.cpp already ships:

Architectures: LLaMA 1/2/3, Mistral, Mixtral (MoE), Qwen 1/2/2.5, Phi 1/2/3, Gemma 1/2, Falcon, MPT, BLOOM, GPT-2, GPT-NeoX, StableLM, Baichuan, Yi, RWKV, Mamba, …
Quantization: Q2_K, Q3_K_S/M/L, Q4_0/1, Q4_K_S/M, Q5_0/1, Q5_K_S/M, Q6_K, Q8_0, Q8_K, IQ1_S/M, IQ2_XXS/XS/S/M, IQ3_XXS/S/M, IQ4_XS/NL, BF16, F16, F32
Backends: CPU (AVX/AVX2/AVX-512/NEON/AMX), CUDA, Metal, Vulkan, SYCL, ROCm, Kompute, OpenCL, RPC, BLAS
Sampling: greedy, temperature, top-k, top-p, min-p, typical-p, tail-free, locally-typical, dynatemp, mirostat v1+v2, repetition penalty, frequency penalty, presence penalty, logit bias, GBNF grammar, JSON mode, classifier-free guidance, beam search, speculative decoding, lookahead decoding
Runtime: continuous batching, parallel sequences, prompt caching, KV-cache shifting/defrag, embeddings, reranking, LoRA hotswap, multi-modal (LLaVA, Phi-3-vision, MiniCPM-V), tools/function-calling, chat templates for every major model
Server: llama-server (OpenAI-compatible HTTP API: completions, chat, embeddings, tools), web UI

TurboQuant adds: a 2 KB Python module that rotates the model's weight matrices in-place using Walsh-Hadamard transforms. The rotation cancels through the residual-stream linear pieces (it's orthogonal) so the model is fp32-bit-identical, but the per-weight-block max-abs that drives Q4 / Q4_K rounding error drops 3-5×, which translates to 0.3-0.5 perplexity improvement at Q4_K_M on LLaMA-2-7B (and bigger gains at lower bit-widths).

Does this actually run faster than stock llama.cpp?

It's the right question and the honest answer has two parts:

Same bit-width: NO

Quantizing a TurboQuant-rotated model at Q4_K_M and running it on stock llama.cpp gives the exact same tokens/sec as a non-rotated Q4_K_M of the same model. Same bytes per weight, same kernels, same memory layout. What you get is better quality at the same speed — about 0.3-0.5 perplexity points back at Q4_K_M on LLaMA-2-7B.

Drop a bit-width tier: YES

The real speed win is using the recovered quality budget to drop one quantization tier:

recipe	bytes/weight	quality	wall-clock decode
baseline Q4_K_M (no rotation)	4.625	reference	reference
TurboQuant Q4_K_M	4.625	better	same
TurboQuant Q3_K_M	3.5	≈ baseline Q4_K_M	~1.20-1.30× faster on memory-bound CPUs
TurboQuant Q2_K (aggressive)	2.6	usable for some tasks	~1.5× faster

The speedup comes from memory bandwidth: decoding is bandwidth-bound on nearly all consumer CPUs (and on Sapphire Rapids when the workload doesn't fit AMX tiles, which is most of them at long context). Fewer bytes per weight read each step = fewer cycles waiting on DRAM.

KV cache: also YES (long context)

turboquant.kvcache.rotate_kv_for_cache_quant() Hadamard-rotates the attention output projection so K and V live in a Gaussianized frame inside the KV cache. Combine with llama.cpp's --cache-type-k q4_0 --cache-type-v q4_0 and you get usable quality at half the KV bandwidth — meaningful at 8K+ context where KV reads dominate.

Reproduce the numbers

# Synthetic micro (1 second, no model needed):
python -m turboquant.bench

# End-to-end on your machine, real GGUF:
./scripts/bench_e2e.sh /path/to/HF/Llama-3-8B

The end-to-end script builds both a baseline-Q4_K_M and a TurboQuant-Q3_K_M GGUF and runs llama-bench on each.

Quick start

No git submodule, no manual cmake — every llama.cpp tool is forwarded into ggml-org's official Docker image, so a clean pip install is the whole setup.

# 1. Install
pip install 'turbocpp[runtime]'

# 2. Verify (downloads TinyLlama, runs a sample completion):
turbocpp quickstart

# 3. End-to-end (the SPEED path: rotated Q3_K_M ≈ baseline Q4_K_M quality).
#    Each step delegates to the right tool — no cmake, no submodule.
turbocpp rotate    ~/models/Llama-3-8B  ~/models/Llama-3-8B-tq
turbocpp convert   ~/models/Llama-3-8B-tq  --outfile Llama-3-8B-tq.gguf
turbocpp quantize  Llama-3-8B-tq.gguf  Llama-3-8B-tq-Q3_K_M.gguf  Q3_K_M
turbocpp generate  -m Llama-3-8B-tq-Q3_K_M.gguf \
                   -p "Explain Hadamard quantization in one sentence:" -n 100

# 4. Or the QUALITY path (same speed as baseline, better numbers):
turbocpp quantize  Llama-3-8B-tq.gguf  Llama-3-8B-tq-Q4_K_M.gguf  Q4_K_M

TurboQuant: the math in one block

For each linear layer y = W x in the residual stream, with H an orthogonal block-Hadamard:

W' = H · W           (output axis rotated)         ← producers
W' = W · Hᵀ          (input axis rotated)          ← consumers

We pair every producer with its consumer: tok_embed, W_o, W_down ← producers (output rotated) W_q, W_k, W_v, W_gate, W_up, lm_head ← consumers (input rotated).

Since H · Hᵀ = I, the rotations cancel through the network. Forward pass in fp32 is bit-identical. But quantization noise is computed on the ROTATED weights, whose per-block distribution is near-Gaussian thanks to the central-limit theorem — and Gaussian distributions quantize well, while heavy-tailed real LLM weights don't.

RMSNorm is rotation-equivariant only if its γ vector is uniform. Pass 1 absorbs each γ into the FOLLOWING linear (W ← W · diag(γ)) and then sets γ ← 1, after which the rotation is safe.

See turboquant/turboquant.py — 100 lines.

Tests

pytest -q turboquant/                       # rotation math + CLI parser (~65 tests)
ctest --test-dir extras/standalone/build    # standalone-engine kernels

CI runs the turboquant tests on Linux + Windows + macOS, plus builds the standalone engine and runs its unit tests.

Related work

QuaRot (Ashkboos et al. 2024)
SpinQuant (Liu et al. 2024)
GPTQ (Frantar et al. 2022) — calibration-based, complementary
AWQ (Lin et al. 2023) — activation-aware scaling, complementary

License

TurboQuant code: MIT (LICENSE)
llama.cpp (pulled at runtime via Docker, no submodule): MIT (upstream LICENSE)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.32.0

May 7, 2026

0.31.0

May 7, 2026

0.30.1

May 7, 2026

0.30.0

May 7, 2026

0.29.0

May 7, 2026

0.28.0

May 7, 2026

0.27.0

May 7, 2026

0.26.1

May 7, 2026

0.26.0

May 7, 2026

This version

0.25.0

May 7, 2026

0.24.0

May 7, 2026

0.23.0

May 7, 2026

0.22.0

May 7, 2026

0.21.0

May 7, 2026

0.20.0

May 7, 2026

0.19.0

May 7, 2026

0.18.0

May 7, 2026

0.17.0

May 7, 2026

0.16.0

May 7, 2026

0.15.0

May 7, 2026

0.14.0

May 7, 2026

0.13.0

May 7, 2026

0.12.0

May 7, 2026

0.11.4

May 6, 2026

0.11.3

May 6, 2026

0.11.1

May 6, 2026

0.11.0

May 6, 2026

0.10.1

May 6, 2026

0.10.0

May 6, 2026

0.9.0

May 5, 2026

0.8.0

May 4, 2026

0.7.0

May 3, 2026

0.6.1

May 3, 2026

0.6.0

May 3, 2026

0.5.0

May 2, 2026

0.4.0

May 2, 2026

0.3.1

May 2, 2026

0.3.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbocpp-0.25.0.tar.gz (52.8 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turbocpp-0.25.0-py3-none-any.whl (56.5 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file turbocpp-0.25.0.tar.gz.

File metadata

Download URL: turbocpp-0.25.0.tar.gz
Upload date: May 7, 2026
Size: 52.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for turbocpp-0.25.0.tar.gz
Algorithm	Hash digest
SHA256	`fd1b6bb968eee558c679362bfa1a2c26a5e0b0485da6c0b50657e20748aa045c`
MD5	`2743ef41a7e9fa6f20cb41c39a17108f`
BLAKE2b-256	`64f93672d5232f788555045a2707fa27312677b87f90c093375691d85ba63752`

See more details on using hashes here.

File details

Details for the file turbocpp-0.25.0-py3-none-any.whl.

File metadata

Download URL: turbocpp-0.25.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 56.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for turbocpp-0.25.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f331743ab9a404699f95f6d937710dd02050ec406d708484369ac96fff21f7f`
MD5	`5e33f2c9bfcd0dfeb0cb7bb09a05fa71`
BLAKE2b-256	`a0ab4913b7fb41610822d3fc878e5966364b00dbc81ac0f2f5b03d1b44caf701`

See more details on using hashes here.

turbocpp 0.25.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

turbocpp

Install

Security

Get actual speedups, not just better quality

Docker

Layout

Why "llama.cpp + TurboQuant"

Does this actually run faster than stock llama.cpp?

Same bit-width: NO

Drop a bit-width tier: YES

KV cache: also YES (long context)

Reproduce the numbers

Quick start

TurboQuant: the math in one block

Tests

Related work

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes