Skip to main content

Temporal Expert Caching (TEC) for MoE inference — run MoE models that overflow GPU VRAM up to 6x faster on consumer hardware.

Project description

fitz-tec

Temporal Expert Caching (TEC) — run Mixture-of-Experts models that overflow your GPU's VRAM up to 6× faster on consumer hardware.

pip install fitz-tec
fitz bench path/to/model.gguf

fitz is a one-command CLI that wraps a patched llama.cpp build implementing TEC: a per-layer GPU cache keyed on the router's own top-K decisions. It turns the --n-cpu-moe overflow path from a streaming bottleneck into a resident cache — for free, with byte-identical output on most models.

Quick start

Two commands. No build tools, no cmake, no patch-apply, no CUDA toolkit install. fitz downloads a prebuilt patched llama.cpp on first run.

# 1. Install fitz (Python ≥ 3.10, ~1 MB):
pip install fitz-tec

# 2. Run a benchmark on any GGUF MoE model:
fitz bench path/to/your-model.gguf

The first invocation will:

  1. Read the GGUF header to auto-detect architecture, expert count, and top-K routing, and auto-pick a sensible cache capacity (C ≈ 0.75 × n_expert).
  2. Download the prebuilt patched llama.cpp archive for your platform from GitHub Releases (~200 MB Linux / ~770 MB Windows), verify its SHA256, and cache it under ~/.fitz-tec/cache/.
  3. Run llama-bench twice — once with TEC disabled for the baseline, once with TEC enabled — and print a colored report with the bars, hardware info, and a big speedup panel.

Subsequent runs reuse the cached binary and take seconds to start.

A few recipes

# Paper headline — Qwen3.5-35B-A3B Q8 at the capacity that peaks:
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 224 --reps 30

# Long context (64k decode, flash attention on by default):
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 192 --reps 10 --extra "-d 64000"

# TEC unified memory — model larger than host RAM, fits RAM + VRAM.
# Step 1: calibrate a frequency-pinned hot set (~1 min, once per arch):
fitz calibrate big-model.gguf               # writes big-model.pinlist

# Step 2: run the bench with the pin list (auto-enables unified mode):
fitz bench big-model.gguf --pin-list big-model.pinlist \
    -c 125 -P 85 -b 512 -ub 256

# Mixtral (negative control — TEC's preconditions don't hold here):
fitz bench Mixtral-8x7B-Instruct-v0.1.Q6_K.gguf

Requirements

  • Python ≥ 3.10
  • An NVIDIA GPU (Ampere / Ada / Blackwell — compute capability 80, 86, 89, or 120) with a driver supporting CUDA 12.4 or newer
  • Linux x86_64 (glibc ≥ 2.35) or Windows x86_64
  • A GGUF MoE model that fits in host RAM, or — with --unified — fits in host RAM + GPU VRAM combined

Troubleshooting

  • "Patched llama.cpp binary not found" — the release download failed. Check your network, or build from source (see docs/building-llama-cpp.md) and set FITZ_TEC_BINARY=/path/to/your/llama-bench to skip the download.
  • TEC throughput collapses to ~10 t/s — you pushed -c past the VRAM cliff. Back off by 5–10%, see docs/capacity-tuning.md.
  • "Preconditions not met" warning on Mixtral-class models — expected, not a bug. TEC needs ≥32 experts per layer to deliver a speedup.
  • Any other failure — the full troubleshooting guide is at docs/troubleshooting.md.

Why it exists

Modern MoE models (Qwen3.5-A3B, Qwen3-Coder-Next, GLM-4.7, Nemotron, gpt-oss, Mixtral, …) activate only a few billion parameters per token, but every expert must still be reachable because any of them can be selected on the next token. On a 32 GB consumer GPU, a 34–80 GiB MoE model does not fit — and llama.cpp's standard workaround, --n-cpu-moe, streams the activated experts over PCIe on every decoded token. That path is bandwidth-stalled, not compute-stalled: an RTX 5090 with 48 GiB DDR5-6000 gets ~25–30 tok/sec on a 34 GiB MoE it could otherwise run at 200+ tok/sec.

The key observation: MoE routers do not route tokens independently. At a fixed layer, the top-K experts selected at token t+1 overlap heavily with the experts selected at tokens t, t−1, …, t−W. On Qwen3.5-35B-A3B we measured 84% overlap at W=8 — an order of magnitude above the 3% expected for uniformly random routing. A small per-layer GPU cache that keeps the experts recently used at each layer eliminates most of the CPU→GPU DMAs, collapsing the bandwidth wall.

Across 11 MoE models spanning 7 architectures and 6 providers (Qwen, Zhipu, NVIDIA, Google, OpenAI, Mistral), TEC delivers 1.85×–6.44× peak speedup over the best non-TEC baseline on the same hardware, with byte-identical greedy output on 9 of 11.

Two demos that bracket the domain

Qwen3-Coder-Next-80B IQ4_XS (42 GiB, overflows 32 GB VRAM):

110 tok/sec on a single RTX 5090 — autocomplete-grade throughput on a flagship 80B-active code model. 4.34× vs the best non-TEC baseline. No unified memory needed; the model fits in 48 GiB host RAM.

Qwen3.5-122B-A10B IQ4_XS (56 GiB, overflows even 48 GiB host RAM):

37 tok/sec via TEC unified memory, which pools host RAM and GPU VRAM into a single 80 GiB addressable expert pool. 4.29× vs the disk-spilled baseline. Without TEC this deployment requires either a dual-GPU machine or a ≥96 GiB server.

Both demonstrations use the same rolling-window cache mechanism — only the memory partition changes.

The optimization chain

TEC headline speedup across optimization phases

Qwen3.5-35B-A3B Q8 (34 GiB) on a single RTX 5090 + 48 GiB DDR5-6000. The locked baseline of 24 tok/sec is the naive -ncmoe path. Each bar adds one optimization; by the final C=224 configuration, every decoded token hits a 99.5%-warm per-layer cache with no DMA stalls — 6.21× peak vs baseline.

Results across 10 MoE models

All measurements on a single RTX 5090 + 48 GiB DDR5-6000 workstation at -n 128 -fa 1 (128-token generation, flash attention on). "Baseline" is the best non-TEC configuration on the same hardware — max of optimal partial -ngl and -ncmoe 99. "TEC" is peak tg128. Speedup is TEC / Baseline. See the paper for methodology and the full generalization study.

Model (quant) Arch Size Baseline (t/s) TEC (t/s) Speedup Hit %
Qwen3.5-122B-A10B · Q2_K_XL qwen35moe 39 GiB 10.8 69.4 6.44× 97.1
Qwen3.5-35B-A3B · Q8_0 qwen35moe 34 GiB 30.8 136.0 4.42× 99.5
Qwen3-Coder-Next-80B · IQ4_XS qwen3next 46 GiB 27.4 109.9 4.34× 98.9
GPT-OSS-120B-REAP-58B · Q4_K_S (°) gpt-oss 39 GiB 46.7 126.6 2.71× 99.2
GLM-4.7-Flash · Q8_K_XL deepseek2 34 GiB 34.2 91.3 2.67× 99.2
gemma-4-26B-A4B · BF16 (¶) gemma4 47 GiB 16.2 37.4 2.31× 95.8
Nemotron-Cascade-2-30B-A3B · Q8_0 nemotron_h 32 GiB 99.5 196.0 1.97× 99.8
Nemotron-3-Nano-30B-A3B · Q8_0 nemotron_h 32 GiB 103.5 191.4 1.85× 99.8
Mixtral-8x7B · Q6_K (#) llama 38 GiB 19.1 12.0 0.63× 94.6
TEC unified memory (◇)
Qwen3.5-122B-A10B · IQ4_XS (◇) qwen35moe 56 GiB 8.6 36.8 4.29× 89.0

(¶) Byte-divergent on a small fraction of decode steps from CUDA mul_mat_id FP non-associativity — output is semantically equivalent (the technique itself is bit-exact; this is a kernel-variance artifact at low-precision K-quants). (°) Same non-associativity class as (¶); also required a per-expert bias-cache extension to the cache registration API. Before that fix, gpt-oss exhibited a 12× throughput collapse and wrong output — with the fix, OpenAI's flagship open-weight MoE joins the "works" column at 126.6 t/s / 2.71× speedup. (#) Negative control: Mixtral violates TEC's preconditions (only 8 experts per layer with K=2), so the cache degenerates to "pin everything" and there's no temporal signal to exploit. TEC is strictly worse on few-large-expert MoEs — this is the paper's clean limitation. (◇) 56 GiB file exceeds host RAM alone; TEC unified memory pools RAM + VRAM into an effective 80 GiB budget. The non-TEC baseline disk-spills at 8.6 t/s.

Byte-identical output on 9 of the 11 rows. The two exceptions (gemma-4 and gpt-oss) produce semantically equivalent greedy continuations under CUDA FP non-associativity rather than any wrong-math bug.

Example output

────────────────────────────────────────────────────────────────────────────────

                      Temporal Expert Caching - Benchmark

────────────────────────────────────────────────────────────────────────────────

  model       Qwen3-Coder-Next-UD-IQ4_XS  ·  38.4 GB
  gpu         32GB VRAM, RTX 5090 @1792GB/s
  memory      48GB RAM, DDR5 @6000MT
  cpu         Ryzen 5 9600
  cache       C=400

  baseline    ███████·····························    19.9 t/s  · avg  18.9
  + TEC v1.1  ████████████████████████████████████    95.7 t/s  · avg  81.9

                             ╔═══════════════════╗
                             ║                   ║
                             ║       4.81×       ║
                             ║      faster       ║
                             ║                   ║
                             ╚═══════════════════╝

            same weights  ·  byte-identical output  ·  software only

────────────────────────────────────────────────────────────────────────────────

                    github.com/yafitzdev/fitz-tec  ·  v0.1.1

────────────────────────────────────────────────────────────────────────────────

CLI reference

Full fitz bench flag list:

flag purpose
-c / --capacity TEC cache capacity (experts per layer). Default: 75% × N
-n / --n-gen Tokens to generate per bench rep (default: 128)
-r / --reps Number of repetitions per condition (default: 5)
-b / --batch-size Logical batch size — reduce to 512 on large models
-ub / --ubatch-size Physical batch size — reduce to 256 on large models
--pin-list PATH Path to a frequency-pinned hot-set file. Auto-enables TEC unified memory so the cache spans host RAM + VRAM, letting you run models larger than host RAM alone. Generate with python -m fitz_tec.tools.build_pin_list from a router trace.
-P / --pinned Per-layer slots pinned permanently in VRAM (default: C/2). Only takes effect with --pin-list.

Hardware labels on the bench report auto-probe your GPU, VRAM capacity, RAM capacity, and DDR generation. If your BIOS reports a different clock than your kit's rated speed, override with FITZ_RAM_LABEL:

setx FITZ_RAM_LABEL "DDR5 @CL36 @6000MT"   # Windows
export FITZ_RAM_LABEL="DDR5 @CL36 @6000MT" # Linux / macOS

Other env vars:

  • FITZ_TEC_BINARY — absolute path to a locally-built patched llama-bench (skips the auto-download; useful for development)
  • FITZ_TEC_RELEASE — pin a specific release tag to download from (default: binaries-v0.1.1)
  • FITZ_GPU_LABEL, FITZ_RAM_LABEL — override auto-probed hardware strings
  • NO_COLOR / FITZ_NO_COLOR — disable ANSI colors in the bench report

The CLI's --pin-list flag is a thin wrapper over the EXPERT_CACHE_PIN_LIST environment variable that the patched llama.cpp reads at load time. If you invoke llama-bench directly, set EXPERT_CACHE_ENABLE=1 EXPERT_CACHE_CAPACITY=N EXPERT_CACHE_PIN_LIST=<path>; unified mode activates automatically.

Status — alpha preview

The CLI, benchmark pipeline, and automatic binary download are all wired end-to-end. pip install fitz-tec && fitz bench <model.gguf> is the supported happy path. The Python package is a thin wrapper around a patched llama.cpp binary that's downloaded on first run from the GitHub Releases page — no local build required.

Two things still flagged as alpha:

  • GPU coverage is currently consumer Ampere/Ada/Blackwell only (sm_80, sm_86, sm_89, sm_120). Turing and Hopper users should build from source — see docs/building-llama-cpp.md.
  • Linux coverage is x86_64 only and assumes a distro with glibc ≥ 2.35 (Ubuntu 22.04+, Debian 12+, Fedora 37+). Older distros can build from source.

If the auto-download fails or you want to iterate on the patched binary, set FITZ_TEC_BINARY to a local build and fitz skips the download entirely.

How TEC works

flowchart LR
    Model[("GGUF model<br/>on disk")] --> RAM[Host RAM<br/>full expert pool]
    RAM -.->|"cold miss<br/>~0.5% of tokens"| Cache
    Router[[MoE router<br/>top-K per layer]] --> LRU{{Host LRU<br/>expert_id → slot_id}}
    LRU -->|hit| Cache[GPU VRAM<br/>per-layer cache<br/>C slots]
    Cache ==>|"99.5%<br/>on-chip"| MatMul[mul_mat_id<br/>slot-indexed]
    MatMul --> Tokens((generated<br/>tokens))

Three-file patch to llama.cpp:

  1. Per-layer GPU cache buffer. For each MoE layer, allocate a dense [C × expert_shape] tensor in VRAM that holds the C most-recently-used experts. A host-side LRU maps expert IDs to cache slot IDs.
  2. Router hook. After each layer's top-K decision, a small GGML_OP_EXPERT_PREFETCH op reads the top-K indices, updates the host LRU, and issues H2D DMAs for any cache-miss experts. The op sits outside CUDA graph capture via a scheduler range-break, so graph acceleration is retained for the ~40 non-prefetch ranges per forward pass.
  3. Matmul substitution. build_moe_ffn's mul_mat_id(expert_weights, x, top_k) is rewritten to mul_mat_id(cache_buffer, x, slot_ids) with remapped slot indices. Single call site in the generic MoE path — TEC works on any model routing through build_moe_ffn without per-architecture code.

Pre-warm at model load copies experts 0..C−1 into the cache synchronously so sample-1 cold-start cost is absorbed at load time, not at inference time. On a 34 GiB Q8 model this adds ~6 seconds to model load.

For the full technique, the rolling-window locality measurements, the buffer-depth formula, and the generalization study, see the paper:

Temporal Expert Caching: Enabling Productive Inference on MoE Models That Overflow GPU VRAM. Yan Fitzner, 2026. Preprint coming soon.

Development

git clone https://github.com/yafitzdev/fitz-tec
cd fitz-tec
pip install -e '.[dev]'
pytest

The CLI is fitz_tec/cli.py, the llama.cpp wrapper is runner.py, and the screenshot-optimized report layout is display.py. Tests don't require a GPU.

Supporting the project

Research and open-source development take time. If fitz saves you a GPU upgrade, or TEC unlocks a model you couldn't otherwise run, consider sponsoring the project — it funds further research on MoE inference efficiency and keeps this work independent.

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitz_tec-0.1.1.tar.gz (45.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fitz_tec-0.1.1-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file fitz_tec-0.1.1.tar.gz.

File metadata

  • Download URL: fitz_tec-0.1.1.tar.gz
  • Upload date:
  • Size: 45.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fitz_tec-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8ec5859a19b6fdf7553f4645ace620db62b947d9aa04693e5d52d7d6f006b519
MD5 b5c44da241a9caa821d2217824c9720c
BLAKE2b-256 5d7ec5c43421b7e3017928c154b4e7cf19c9543438ddc4a47c2502355cda2440

See more details on using hashes here.

Provenance

The following attestation bundles were made for fitz_tec-0.1.1.tar.gz:

Publisher: publish.yml on yafitzdev/fitz-tec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fitz_tec-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fitz_tec-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fitz_tec-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9749452eae404b1c6eab9be13f2ae8eb8109bca665868bafbfbd28410080018a
MD5 6c8f10d8f65c004c85df997d37a466b5
BLAKE2b-256 bca929724844cb3868bc32b8042fa1ebcaf2f38754ab2011a5a15dc51c383ef5

See more details on using hashes here.

Provenance

The following attestation bundles were made for fitz_tec-0.1.1-py3-none-any.whl:

Publisher: publish.yml on yafitzdev/fitz-tec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page