Temporal Expert Caching (TEC) for MoE inference — run MoE models that overflow GPU VRAM up to 6x faster on consumer hardware.
Project description
fitz-tec
Temporal Expert Caching (TEC) — run Mixture-of-Experts models that overflow your GPU's VRAM up to 6× faster on consumer hardware.
pip install fitz-tec
fitz bench path/to/model.gguf
fitz is a one-command CLI that wraps a patched llama.cpp build
implementing TEC: a per-layer GPU cache keyed on the router's own top-K
decisions. It turns the --n-cpu-moe overflow path from a streaming
bottleneck into a resident cache — for free, with byte-identical output on
most models.
Quick start
Two commands. No build tools, no cmake, no patch-apply, no CUDA toolkit
install. fitz downloads a prebuilt patched llama.cpp on first run.
# 1. Install fitz (Python ≥ 3.10, ~1 MB):
pip install fitz-tec
# 2. Run a benchmark on any GGUF MoE model:
fitz bench path/to/your-model.gguf
The first invocation will:
- Read the GGUF header to auto-detect architecture, expert count, and
top-K routing, and auto-pick a sensible cache capacity (
C ≈ 0.75 × n_expert). - Download the prebuilt patched
llama.cpparchive for your platform from GitHub Releases (~200 MB Linux / ~770 MB Windows), verify its SHA256, and cache it under~/.fitz-tec/cache/. - Run
llama-benchtwice — once with TEC disabled for the baseline, once with TEC enabled — and print a colored report with the bars, hardware info, and a big speedup panel.
Subsequent runs reuse the cached binary and take seconds to start.
A few recipes
# Paper headline — Qwen3.5-35B-A3B Q8 at the capacity that peaks:
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 224 --reps 30
# Long context (64k decode, flash attention on by default):
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 192 --reps 10 --extra "-d 64000"
# TEC unified memory — model larger than host RAM, fits RAM + VRAM.
# Step 1: calibrate a frequency-pinned hot set (~1 min, once per arch):
fitz calibrate big-model.gguf # writes big-model.pinlist
# Step 2: run the bench with the pin list (auto-enables unified mode):
fitz bench big-model.gguf --pin-list big-model.pinlist \
-c 125 -P 85 -b 512 -ub 256
# Mixtral (negative control — TEC's preconditions don't hold here):
fitz bench Mixtral-8x7B-Instruct-v0.1.Q6_K.gguf
Requirements
- Python ≥ 3.10
- An NVIDIA GPU (Ampere / Ada / Blackwell — compute capability 80, 86, 89, or 120) with a driver supporting CUDA 12.4 or newer
- Linux x86_64 (glibc ≥ 2.35) or Windows x86_64
- A GGUF MoE model that fits in host RAM, or — with
--unified— fits in host RAM + GPU VRAM combined
Troubleshooting
- "Patched llama.cpp binary not found" — the release download failed.
Check your network, or build from source
(see
docs/building-llama-cpp.md) and setFITZ_TEC_BINARY=/path/to/your/llama-benchto skip the download. - TEC throughput collapses to ~10 t/s — you pushed
-cpast the VRAM cliff. Back off by 5–10%, seedocs/capacity-tuning.md. - "Preconditions not met" warning on Mixtral-class models — expected, not a bug. TEC needs ≥32 experts per layer to deliver a speedup.
- Any other failure — the full troubleshooting guide is at
docs/troubleshooting.md.
Why it exists
Modern MoE models (Qwen3.5-A3B, Qwen3-Coder-Next, GLM-4.7, Nemotron, gpt-oss,
Mixtral, …) activate only a few billion parameters per token, but every
expert must still be reachable because any of them can be selected on the
next token. On a 32 GB consumer GPU, a 34–80 GiB MoE model does not fit —
and llama.cpp's standard workaround, --n-cpu-moe, streams the activated
experts over PCIe on every decoded token. That path is bandwidth-stalled,
not compute-stalled: an RTX 5090 with 48 GiB DDR5-6000 gets ~25–30 tok/sec
on a 34 GiB MoE it could otherwise run at 200+ tok/sec.
The key observation: MoE routers do not route tokens independently. At a fixed layer, the top-K experts selected at token t+1 overlap heavily with the experts selected at tokens t, t−1, …, t−W. On Qwen3.5-35B-A3B we measured 84% overlap at W=8 — an order of magnitude above the 3% expected for uniformly random routing. A small per-layer GPU cache that keeps the experts recently used at each layer eliminates most of the CPU→GPU DMAs, collapsing the bandwidth wall.
Across 11 MoE models spanning 7 architectures and 6 providers (Qwen, Zhipu, NVIDIA, Google, OpenAI, Mistral), TEC delivers 1.85×–6.44× peak speedup over the best non-TEC baseline on the same hardware, with byte-identical greedy output on 9 of 11.
Two demos that bracket the domain
Qwen3-Coder-Next-80B IQ4_XS (42 GiB, overflows 32 GB VRAM):
110 tok/sec on a single RTX 5090 — autocomplete-grade throughput on a flagship 80B-active code model. 4.34× vs the best non-TEC baseline. No unified memory needed; the model fits in 48 GiB host RAM.
Qwen3.5-122B-A10B IQ4_XS (56 GiB, overflows even 48 GiB host RAM):
37 tok/sec via TEC unified memory, which pools host RAM and GPU VRAM into a single 80 GiB addressable expert pool. 4.29× vs the disk-spilled baseline. Without TEC this deployment requires either a dual-GPU machine or a ≥96 GiB server.
Both demonstrations use the same rolling-window cache mechanism — only the memory partition changes.
The optimization chain
Qwen3.5-35B-A3B Q8 (34 GiB) on a single RTX 5090 + 48 GiB DDR5-6000.
The locked baseline of 24 tok/sec is the naive -ncmoe path. Each bar
adds one optimization; by the final C=224 configuration, every
decoded token hits a 99.5%-warm per-layer cache with no DMA stalls —
6.21× peak vs baseline.
Results across 10 MoE models
All measurements on a single RTX 5090 + 48 GiB DDR5-6000 workstation
at -n 128 -fa 1 (128-token generation, flash attention on). "Baseline"
is the best non-TEC configuration on the same hardware — max of
optimal partial -ngl and -ncmoe 99. "TEC" is peak tg128. Speedup
is TEC / Baseline. See the paper for methodology and the full
generalization study.
| Model (quant) | Arch | Size | Baseline (t/s) | TEC (t/s) | Speedup | Hit % |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B · Q2_K_XL | qwen35moe | 39 GiB | 10.8 | 69.4 | 6.44× | 97.1 |
| Qwen3.5-35B-A3B · Q8_0 | qwen35moe | 34 GiB | 30.8 | 136.0 | 4.42× | 99.5 |
| Qwen3-Coder-Next-80B · IQ4_XS | qwen3next | 46 GiB | 27.4 | 109.9 | 4.34× | 98.9 |
| GPT-OSS-120B-REAP-58B · Q4_K_S (°) | gpt-oss | 39 GiB | 46.7 | 126.6 | 2.71× | 99.2 |
| GLM-4.7-Flash · Q8_K_XL | deepseek2 | 34 GiB | 34.2 | 91.3 | 2.67× | 99.2 |
| gemma-4-26B-A4B · BF16 (¶) | gemma4 | 47 GiB | 16.2 | 37.4 | 2.31× | 95.8 |
| Nemotron-Cascade-2-30B-A3B · Q8_0 | nemotron_h | 32 GiB | 99.5 | 196.0 | 1.97× | 99.8 |
| Nemotron-3-Nano-30B-A3B · Q8_0 | nemotron_h | 32 GiB | 103.5 | 191.4 | 1.85× | 99.8 |
| Mixtral-8x7B · Q6_K (#) | llama | 38 GiB | 19.1 | 12.0 | 0.63× | 94.6 |
| TEC unified memory (◇) | ||||||
| Qwen3.5-122B-A10B · IQ4_XS (◇) | qwen35moe | 56 GiB | 8.6 | 36.8 | 4.29× | 89.0 |
(¶) Byte-divergent on a small fraction of decode steps from CUDA
mul_mat_id FP non-associativity — output is semantically equivalent
(the technique itself is bit-exact; this is a kernel-variance artifact
at low-precision K-quants). (°) Same non-associativity class as (¶);
also required a per-expert bias-cache extension to the cache
registration API. Before that fix, gpt-oss exhibited a 12× throughput
collapse and wrong output — with the fix, OpenAI's flagship open-weight
MoE joins the "works" column at 126.6 t/s / 2.71× speedup.
(#) Negative control: Mixtral violates TEC's preconditions (only
8 experts per layer with K=2), so the cache degenerates to "pin
everything" and there's no temporal signal to exploit. TEC is
strictly worse on few-large-expert MoEs — this is the paper's clean
limitation. (◇) 56 GiB file exceeds host RAM alone; TEC unified
memory pools RAM + VRAM into an effective 80 GiB budget. The non-TEC
baseline disk-spills at 8.6 t/s.
Byte-identical output on 9 of the 11 rows. The two exceptions (gemma-4 and gpt-oss) produce semantically equivalent greedy continuations under CUDA FP non-associativity rather than any wrong-math bug.
Example output
────────────────────────────────────────────────────────────────────────────────
Temporal Expert Caching - Benchmark
────────────────────────────────────────────────────────────────────────────────
model Qwen3-Coder-Next-UD-IQ4_XS · 38.4 GB
gpu 32GB VRAM, RTX 5090 @1792GB/s
memory 48GB RAM, DDR5 @6000MT
cpu Ryzen 5 9600
cache C=400
baseline ███████····························· 19.9 t/s · avg 18.9
+ TEC v1.1 ████████████████████████████████████ 95.7 t/s · avg 81.9
╔═══════════════════╗
║ ║
║ 4.81× ║
║ faster ║
║ ║
╚═══════════════════╝
same weights · byte-identical output · software only
────────────────────────────────────────────────────────────────────────────────
github.com/yafitzdev/fitz-tec · v0.1.1
────────────────────────────────────────────────────────────────────────────────
CLI reference
Full fitz bench flag list:
| flag | purpose |
|---|---|
-c / --capacity |
TEC cache capacity (experts per layer). Default: 75% × N |
-n / --n-gen |
Tokens to generate per bench rep (default: 128) |
-r / --reps |
Number of repetitions per condition (default: 5) |
-b / --batch-size |
Logical batch size — reduce to 512 on large models |
-ub / --ubatch-size |
Physical batch size — reduce to 256 on large models |
--pin-list PATH |
Path to a frequency-pinned hot-set file. Auto-enables TEC unified memory so the cache spans host RAM + VRAM, letting you run models larger than host RAM alone. Generate with python -m fitz_tec.tools.build_pin_list from a router trace. |
-P / --pinned |
Per-layer slots pinned permanently in VRAM (default: C/2). Only takes effect with --pin-list. |
Hardware labels on the bench report auto-probe your GPU, VRAM capacity,
RAM capacity, and DDR generation. If your BIOS reports a different clock
than your kit's rated speed, override with FITZ_RAM_LABEL:
setx FITZ_RAM_LABEL "DDR5 @CL36 @6000MT" # Windows
export FITZ_RAM_LABEL="DDR5 @CL36 @6000MT" # Linux / macOS
Other env vars:
FITZ_TEC_BINARY— absolute path to a locally-built patchedllama-bench(skips the auto-download; useful for development)FITZ_TEC_RELEASE— pin a specific release tag to download from (default:binaries-v0.1.1)FITZ_GPU_LABEL,FITZ_RAM_LABEL— override auto-probed hardware stringsNO_COLOR/FITZ_NO_COLOR— disable ANSI colors in the bench report
The CLI's --pin-list flag is a thin wrapper over the EXPERT_CACHE_PIN_LIST
environment variable that the patched llama.cpp reads at load time. If you
invoke llama-bench directly, set
EXPERT_CACHE_ENABLE=1 EXPERT_CACHE_CAPACITY=N EXPERT_CACHE_PIN_LIST=<path>;
unified mode activates automatically.
Status — alpha preview
The CLI, benchmark pipeline, and automatic binary download are all
wired end-to-end. pip install fitz-tec && fitz bench <model.gguf>
is the supported happy path. The Python package is a thin wrapper
around a patched llama.cpp binary that's downloaded on first run
from the GitHub Releases page — no local build required.
Two things still flagged as alpha:
- GPU coverage is currently consumer Ampere/Ada/Blackwell only
(
sm_80,sm_86,sm_89,sm_120). Turing and Hopper users should build from source — seedocs/building-llama-cpp.md. - Linux coverage is x86_64 only and assumes a distro with glibc ≥ 2.35 (Ubuntu 22.04+, Debian 12+, Fedora 37+). Older distros can build from source.
If the auto-download fails or you want to iterate on the patched
binary, set FITZ_TEC_BINARY to a local build and fitz skips the
download entirely.
How TEC works
flowchart LR
Model[("GGUF model<br/>on disk")] --> RAM[Host RAM<br/>full expert pool]
RAM -.->|"cold miss<br/>~0.5% of tokens"| Cache
Router[[MoE router<br/>top-K per layer]] --> LRU{{Host LRU<br/>expert_id → slot_id}}
LRU -->|hit| Cache[GPU VRAM<br/>per-layer cache<br/>C slots]
Cache ==>|"99.5%<br/>on-chip"| MatMul[mul_mat_id<br/>slot-indexed]
MatMul --> Tokens((generated<br/>tokens))
Three-file patch to llama.cpp:
- Per-layer GPU cache buffer. For each MoE layer, allocate a dense
[C × expert_shape]tensor in VRAM that holds the C most-recently-used experts. A host-side LRU maps expert IDs to cache slot IDs. - Router hook. After each layer's top-K decision, a small
GGML_OP_EXPERT_PREFETCHop reads the top-K indices, updates the host LRU, and issues H2D DMAs for any cache-miss experts. The op sits outside CUDA graph capture via a scheduler range-break, so graph acceleration is retained for the ~40 non-prefetch ranges per forward pass. - Matmul substitution.
build_moe_ffn'smul_mat_id(expert_weights, x, top_k)is rewritten tomul_mat_id(cache_buffer, x, slot_ids)with remapped slot indices. Single call site in the generic MoE path — TEC works on any model routing throughbuild_moe_ffnwithout per-architecture code.
Pre-warm at model load copies experts 0..C−1 into the cache synchronously
so sample-1 cold-start cost is absorbed at load time, not at inference
time. On a 34 GiB Q8 model this adds ~6 seconds to model load.
For the full technique, the rolling-window locality measurements, the buffer-depth formula, and the generalization study, see the paper:
Temporal Expert Caching: Enabling Productive Inference on MoE Models That Overflow GPU VRAM. Yan Fitzner, 2026. Preprint coming soon.
Development
git clone https://github.com/yafitzdev/fitz-tec
cd fitz-tec
pip install -e '.[dev]'
pytest
The CLI is fitz_tec/cli.py, the llama.cpp wrapper is runner.py, and
the screenshot-optimized report layout is display.py. Tests don't
require a GPU.
Supporting the project
Research and open-source development take time. If fitz saves you a GPU
upgrade, or TEC unlocks a model you couldn't otherwise run, consider
sponsoring the project —
it funds further research on MoE inference efficiency and keeps this work
independent.
License
Apache License 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fitz_tec-0.1.1.tar.gz.
File metadata
- Download URL: fitz_tec-0.1.1.tar.gz
- Upload date:
- Size: 45.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ec5859a19b6fdf7553f4645ace620db62b947d9aa04693e5d52d7d6f006b519
|
|
| MD5 |
b5c44da241a9caa821d2217824c9720c
|
|
| BLAKE2b-256 |
5d7ec5c43421b7e3017928c154b4e7cf19c9543438ddc4a47c2502355cda2440
|
Provenance
The following attestation bundles were made for fitz_tec-0.1.1.tar.gz:
Publisher:
publish.yml on yafitzdev/fitz-tec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fitz_tec-0.1.1.tar.gz -
Subject digest:
8ec5859a19b6fdf7553f4645ace620db62b947d9aa04693e5d52d7d6f006b519 - Sigstore transparency entry: 1283330862
- Sigstore integration time:
-
Permalink:
yafitzdev/fitz-tec@b00e7a79cce5c27da21ef547191d04e0f366b771 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yafitzdev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b00e7a79cce5c27da21ef547191d04e0f366b771 -
Trigger Event:
release
-
Statement type:
File details
Details for the file fitz_tec-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fitz_tec-0.1.1-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9749452eae404b1c6eab9be13f2ae8eb8109bca665868bafbfbd28410080018a
|
|
| MD5 |
6c8f10d8f65c004c85df997d37a466b5
|
|
| BLAKE2b-256 |
bca929724844cb3868bc32b8042fa1ebcaf2f38754ab2011a5a15dc51c383ef5
|
Provenance
The following attestation bundles were made for fitz_tec-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on yafitzdev/fitz-tec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fitz_tec-0.1.1-py3-none-any.whl -
Subject digest:
9749452eae404b1c6eab9be13f2ae8eb8109bca665868bafbfbd28410080018a - Sigstore transparency entry: 1283330978
- Sigstore integration time:
-
Permalink:
yafitzdev/fitz-tec@b00e7a79cce5c27da21ef547191d04e0f366b771 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yafitzdev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b00e7a79cce5c27da21ef547191d04e0f366b771 -
Trigger Event:
release
-
Statement type: