Temporal Expert Caching (TEC) for MoE inference — run MoE models that overflow GPU VRAM up to 6x faster on consumer hardware.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yafitzdev

These details have not been verified by PyPI

Project description

fitz-tec

Temporal Expert Caching (TEC) — run Mixture-of-Experts models that overflow your GPU's VRAM up to 6× faster on consumer hardware.

pip install fitz-tec
fitz bench path/to/model.gguf

fitz is a one-command CLI that wraps a patched llama.cpp build implementing TEC: a per-layer GPU cache keyed on the router's own top-K decisions. It turns the --n-cpu-moe overflow path from a streaming bottleneck into a resident cache — for free, with byte-identical output on most models.

Quick start

Two commands. No build tools, no cmake, no patch-apply, no CUDA toolkit install. fitz downloads a prebuilt patched llama.cpp on first run.

# 1. Install fitz (Python ≥ 3.10, ~1 MB):
pip install fitz-tec

# 2. Run a benchmark on any GGUF MoE model:
fitz bench path/to/your-model.gguf

The first invocation will:

Read the GGUF header to auto-detect architecture, expert count, and top-K routing, and auto-pick a sensible cache capacity (C ≈ 0.75 × n_expert).
Download the prebuilt patched llama.cpp archive for your platform from GitHub Releases (~200 MB Linux / ~770 MB Windows), verify its SHA256, and cache it under ~/.fitz-tec/cache/.
Run llama-bench twice — once with TEC disabled for the baseline, once with TEC enabled — and print a colored report with the bars, hardware info, and a big speedup panel.

Subsequent runs reuse the cached binary and take seconds to start.

A few recipes

# Paper headline — Qwen3.5-35B-A3B Q8 at the capacity that peaks:
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 224 --reps 30

# Long context (64k decode, flash attention on by default):
fitz bench Qwen3.5-35B-A3B-Q8_0.gguf -c 192 --reps 10 --extra "-d 64000"

# TEC unified memory — model larger than host RAM, fits RAM + VRAM.
# Step 1: calibrate a frequency-pinned hot set (~1 min, once per arch):
fitz calibrate big-model.gguf               # writes big-model.pinlist

# Step 2: run the bench with the pin list (auto-enables unified mode):
fitz bench big-model.gguf --pin-list big-model.pinlist \
    -c 125 -P 85 -b 512 -ub 256

# Mixtral (negative control — TEC's preconditions don't hold here):
fitz bench Mixtral-8x7B-Instruct-v0.1.Q6_K.gguf

Requirements

Python ≥ 3.10
An NVIDIA GPU (Ampere / Ada / Blackwell — compute capability 80, 86, 89, or 120) with a driver supporting CUDA 12.4 or newer
Linux x86_64 (glibc ≥ 2.35) or Windows x86_64
A GGUF MoE model that fits in host RAM, or — with --unified — fits in host RAM + GPU VRAM combined

Troubleshooting

"Patched llama.cpp binary not found" — the release download failed. Check your network, or build from source (see docs/building-llama-cpp.md) and set FITZ_TEC_BINARY=/path/to/your/llama-bench to skip the download.
TEC throughput collapses to ~10 t/s — you pushed -c past the VRAM cliff. Back off by 5–10%, see docs/capacity-tuning.md.
"Preconditions not met" warning on Mixtral-class models — expected, not a bug. TEC needs ≥32 experts per layer to deliver a speedup.
Any other failure — the full troubleshooting guide is at docs/troubleshooting.md.

Why it exists

Modern MoE models (Qwen3.5-A3B, Qwen3-Coder-Next, GLM-4.7, Nemotron, gpt-oss, Mixtral, …) activate only a few billion parameters per token, but every expert must still be reachable because any of them can be selected on the next token. On a 32 GB consumer GPU, a 34–80 GiB MoE model does not fit — and llama.cpp's standard workaround, --n-cpu-moe, streams the activated experts over PCIe on every decoded token. That path is bandwidth-stalled, not compute-stalled: an RTX 5090 with 48 GiB DDR5-6000 gets ~25–30 tok/sec on a 34 GiB MoE it could otherwise run at 200+ tok/sec.

The key observation: MoE routers do not route tokens independently. At a fixed layer, the top-K experts selected at token t+1 overlap heavily with the experts selected at tokens t, t−1, …, t−W. On Qwen3.5-35B-A3B we measured 84% overlap at W=8 — an order of magnitude above the 3% expected for uniformly random routing. A small per-layer GPU cache that keeps the experts recently used at each layer eliminates most of the CPU→GPU DMAs, collapsing the bandwidth wall.

Across 11 MoE models spanning 7 architectures and 6 providers (Qwen, Zhipu, NVIDIA, Google, OpenAI, Mistral), TEC delivers 1.85×–6.44× peak speedup over the best non-TEC baseline on the same hardware, with byte-identical greedy output on 9 of 11.

Two demos that bracket the domain

Qwen3-Coder-Next-80B IQ4_XS (42 GiB, overflows 32 GB VRAM):

110 tok/sec on a single RTX 5090 — autocomplete-grade throughput on a flagship 80B-active code model. 4.34× vs the best non-TEC baseline. No unified memory needed; the model fits in 48 GiB host RAM.

Qwen3.5-122B-A10B IQ4_XS (56 GiB, overflows even 48 GiB host RAM):

37 tok/sec via TEC unified memory, which pools host RAM and GPU VRAM into a single 80 GiB addressable expert pool. 4.29× vs the disk-spilled baseline. Without TEC this deployment requires either a dual-GPU machine or a ≥96 GiB server.

Both demonstrations use the same rolling-window cache mechanism — only the memory partition changes.

The optimization chain

TEC headline speedup across optimization phases

Qwen3.5-35B-A3B Q8 (34 GiB) on a single RTX 5090 + 48 GiB DDR5-6000. The locked baseline of 24 tok/sec is the naive -ncmoe path. Each bar adds one optimization; by the final C=224 configuration, every decoded token hits a 99.5%-warm per-layer cache with no DMA stalls — 6.21× peak vs baseline.

Results across 10 MoE models

All measurements on a single RTX 5090 + 48 GiB DDR5-6000 workstation at -n 128 -fa 1 (128-token generation, flash attention on). "Baseline" is the best non-TEC configuration on the same hardware — max of optimal partial -ngl and -ncmoe 99. "TEC" is peak tg128. Speedup is TEC / Baseline. See the paper for methodology and the full generalization study.

Model (quant)	Arch	Size	Baseline (t/s)	TEC (t/s)	Speedup	Hit %
Qwen3.5-122B-A10B · Q2_K_XL	qwen35moe	39 GiB	10.8	69.4	6.44×	97.1
Qwen3.5-35B-A3B · Q8_0	qwen35moe	34 GiB	30.8	136.0	4.42×	99.5
Qwen3-Coder-Next-80B · IQ4_XS	qwen3next	46 GiB	27.4	109.9	4.34×	98.9
GPT-OSS-120B-REAP-58B · Q4_K_S (°)	gpt-oss	39 GiB	46.7	126.6	2.71×	99.2
GLM-4.7-Flash · Q8_K_XL	deepseek2	34 GiB	34.2	91.3	2.67×	99.2
gemma-4-26B-A4B · BF16 (¶)	gemma4	47 GiB	16.2	37.4	2.31×	95.8
Nemotron-Cascade-2-30B-A3B · Q8_0	nemotron_h	32 GiB	99.5	196.0	1.97×	99.8
Nemotron-3-Nano-30B-A3B · Q8_0	nemotron_h	32 GiB	103.5	191.4	1.85×	99.8
Mixtral-8x7B · Q6_K (#)	llama	38 GiB	19.1	12.0	0.63×	94.6
TEC unified memory (◇)
Qwen3.5-122B-A10B · IQ4_XS (◇)	qwen35moe	56 GiB	8.6	36.8	4.29×	89.0

(¶) Byte-divergent on a small fraction of decode steps from CUDA mul_mat_id FP non-associativity — output is semantically equivalent (the technique itself is bit-exact; this is a kernel-variance artifact at low-precision K-quants). (°) Same non-associativity class as (¶); also required a per-expert bias-cache extension to the cache registration API. Before that fix, gpt-oss exhibited a 12× throughput collapse and wrong output — with the fix, OpenAI's flagship open-weight MoE joins the "works" column at 126.6 t/s / 2.71× speedup. (#) Negative control: Mixtral violates TEC's preconditions (only 8 experts per layer with K=2), so the cache degenerates to "pin everything" and there's no temporal signal to exploit. TEC is strictly worse on few-large-expert MoEs — this is the paper's clean limitation. (◇) 56 GiB file exceeds host RAM alone; TEC unified memory pools RAM + VRAM into an effective 80 GiB budget. The non-TEC baseline disk-spills at 8.6 t/s.

Byte-identical output on 9 of the 11 rows. The two exceptions (gemma-4 and gpt-oss) produce semantically equivalent greedy continuations under CUDA FP non-associativity rather than any wrong-math bug.

Example output

────────────────────────────────────────────────────────────────────────────────

                      Temporal Expert Caching - Benchmark

────────────────────────────────────────────────────────────────────────────────

  model       Qwen3-Coder-Next-UD-IQ4_XS  ·  38.4 GB
  gpu         32GB VRAM, RTX 5090 @1792GB/s
  memory      48GB RAM, DDR5 @6000MT
  cpu         Ryzen 5 9600
  cache       C=400

  baseline    ███████·····························    19.9 t/s  · avg  18.9
  + TEC v1.1  ████████████████████████████████████    95.7 t/s  · avg  81.9

                             ╔═══════════════════╗
                             ║                   ║
                             ║       4.81×       ║
                             ║      faster       ║
                             ║                   ║
                             ╚═══════════════════╝

            same weights  ·  byte-identical output  ·  software only

────────────────────────────────────────────────────────────────────────────────

                    github.com/yafitzdev/fitz-tec  ·  v0.1.1

────────────────────────────────────────────────────────────────────────────────

CLI reference

Full fitz bench flag list:

flag	purpose
`-c / --capacity`	TEC cache capacity (experts per layer). Default: 75% × N
`-n / --n-gen`	Tokens to generate per bench rep (default: 128)
`-r / --reps`	Number of repetitions per condition (default: 5)
`-b / --batch-size`	Logical batch size — reduce to 512 on large models
`-ub / --ubatch-size`	Physical batch size — reduce to 256 on large models
`--pin-list PATH`	Path to a frequency-pinned hot-set file. Auto-enables TEC unified memory so the cache spans host RAM + VRAM, letting you run models larger than host RAM alone. Generate with `python -m fitz_tec.tools.build_pin_list` from a router trace.
`-P / --pinned`	Per-layer slots pinned permanently in VRAM (default: C/2). Only takes effect with `--pin-list`.

Hardware labels on the bench report auto-probe your GPU, VRAM capacity, RAM capacity, and DDR generation. If your BIOS reports a different clock than your kit's rated speed, override with FITZ_RAM_LABEL:

setx FITZ_RAM_LABEL "DDR5 @CL36 @6000MT"   # Windows
export FITZ_RAM_LABEL="DDR5 @CL36 @6000MT" # Linux / macOS

Other env vars:

FITZ_TEC_BINARY — absolute path to a locally-built patched llama-bench (skips the auto-download; useful for development)
FITZ_TEC_RELEASE — pin a specific release tag to download from (default: binaries-v0.1.1)
FITZ_GPU_LABEL, FITZ_RAM_LABEL — override auto-probed hardware strings
NO_COLOR / FITZ_NO_COLOR — disable ANSI colors in the bench report

The CLI's --pin-list flag is a thin wrapper over the EXPERT_CACHE_PIN_LIST environment variable that the patched llama.cpp reads at load time. If you invoke llama-bench directly, set EXPERT_CACHE_ENABLE=1 EXPERT_CACHE_CAPACITY=N EXPERT_CACHE_PIN_LIST=<path>; unified mode activates automatically.

Status — alpha preview

The CLI, benchmark pipeline, and automatic binary download are all wired end-to-end. pip install fitz-tec && fitz bench <model.gguf> is the supported happy path. The Python package is a thin wrapper around a patched llama.cpp binary that's downloaded on first run from the GitHub Releases page — no local build required.

Two things still flagged as alpha:

GPU coverage is currently consumer Ampere/Ada/Blackwell only (sm_80, sm_86, sm_89, sm_120). Turing and Hopper users should build from source — see docs/building-llama-cpp.md.
Linux coverage is x86_64 only and assumes a distro with glibc ≥ 2.35 (Ubuntu 22.04+, Debian 12+, Fedora 37+). Older distros can build from source.

If the auto-download fails or you want to iterate on the patched binary, set FITZ_TEC_BINARY to a local build and fitz skips the download entirely.

How TEC works

flowchart LR
    Model[("GGUF model<br/>on disk")] --> RAM[Host RAM<br/>full expert pool]
    RAM -.->|"cold miss<br/>~0.5% of tokens"| Cache
    Router[[MoE router<br/>top-K per layer]] --> LRU{{Host LRU<br/>expert_id → slot_id}}
    LRU -->|hit| Cache[GPU VRAM<br/>per-layer cache<br/>C slots]
    Cache ==>|"99.5%<br/>on-chip"| MatMul[mul_mat_id<br/>slot-indexed]
    MatMul --> Tokens((generated<br/>tokens))

Three-file patch to llama.cpp:

Per-layer GPU cache buffer. For each MoE layer, allocate a dense [C × expert_shape] tensor in VRAM that holds the C most-recently-used experts. A host-side LRU maps expert IDs to cache slot IDs.
Router hook. After each layer's top-K decision, a small GGML_OP_EXPERT_PREFETCH op reads the top-K indices, updates the host LRU, and issues H2D DMAs for any cache-miss experts. The op sits outside CUDA graph capture via a scheduler range-break, so graph acceleration is retained for the ~40 non-prefetch ranges per forward pass.
Matmul substitution. build_moe_ffn's mul_mat_id(expert_weights, x, top_k) is rewritten to mul_mat_id(cache_buffer, x, slot_ids) with remapped slot indices. Single call site in the generic MoE path — TEC works on any model routing through build_moe_ffn without per-architecture code.

Pre-warm at model load copies experts 0..C−1 into the cache synchronously so sample-1 cold-start cost is absorbed at load time, not at inference time. On a 34 GiB Q8 model this adds ~6 seconds to model load.

For the full technique, the rolling-window locality measurements, the buffer-depth formula, and the generalization study, see the paper:

Temporal Expert Caching: Enabling Productive Inference on MoE Models That Overflow GPU VRAM. Yan Fitzner, 2026. Preprint coming soon.

Development

git clone https://github.com/yafitzdev/fitz-tec
cd fitz-tec
pip install -e '.[dev]'
pytest

The CLI is fitz_tec/cli.py, the llama.cpp wrapper is runner.py, and the screenshot-optimized report layout is display.py. Tests don't require a GPU.

Supporting the project

Research and open-source development take time. If fitz saves you a GPU upgrade, or TEC unlocks a model you couldn't otherwise run, consider sponsoring the project — it funds further research on MoE inference efficiency and keeps this work independent.

License

Apache License 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yafitzdev

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitz_tec-0.1.1.tar.gz (45.6 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fitz_tec-0.1.1-py3-none-any.whl (41.9 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file fitz_tec-0.1.1.tar.gz.

File metadata

Download URL: fitz_tec-0.1.1.tar.gz
Upload date: Apr 13, 2026
Size: 45.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fitz_tec-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8ec5859a19b6fdf7553f4645ace620db62b947d9aa04693e5d52d7d6f006b519`
MD5	`b5c44da241a9caa821d2217824c9720c`
BLAKE2b-256	`5d7ec5c43421b7e3017928c154b4e7cf19c9543438ddc4a47c2502355cda2440`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fitz_tec-0.1.1.tar.gz:

Publisher: publish.yml on yafitzdev/fitz-tec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fitz_tec-0.1.1.tar.gz
- Subject digest: 8ec5859a19b6fdf7553f4645ace620db62b947d9aa04693e5d52d7d6f006b519
- Sigstore transparency entry: 1283330862
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: yafitzdev/fitz-tec@b00e7a79cce5c27da21ef547191d04e0f366b771
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/yafitzdev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b00e7a79cce5c27da21ef547191d04e0f366b771
- Trigger Event: release

File details

Details for the file fitz_tec-0.1.1-py3-none-any.whl.

File metadata

Download URL: fitz_tec-0.1.1-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 41.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fitz_tec-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9749452eae404b1c6eab9be13f2ae8eb8109bca665868bafbfbd28410080018a`
MD5	`6c8f10d8f65c004c85df997d37a466b5`
BLAKE2b-256	`bca929724844cb3868bc32b8042fa1ebcaf2f38754ab2011a5a15dc51c383ef5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fitz_tec-0.1.1-py3-none-any.whl:

Publisher: publish.yml on yafitzdev/fitz-tec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fitz_tec-0.1.1-py3-none-any.whl
- Subject digest: 9749452eae404b1c6eab9be13f2ae8eb8109bca665868bafbfbd28410080018a
- Sigstore transparency entry: 1283330978
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: yafitzdev/fitz-tec@b00e7a79cce5c27da21ef547191d04e0f366b771
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/yafitzdev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b00e7a79cce5c27da21ef547191d04e0f366b771
- Trigger Event: release

fitz-tec 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

fitz-tec

Quick start

A few recipes

Requirements

Troubleshooting

Why it exists

Two demos that bracket the domain

The optimization chain

Results across 10 MoE models

Example output

CLI reference

Status — alpha preview

How TEC works

Development

Supporting the project

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance