Skip to main content

A domain-specific language for Mixture-of-Experts scheduling policies

Project description

MoE-PolicyLang

A scheduling language for Mixture-of-Experts models.

Author: Jesse Pokora · License: MIT


What Is This?

Large language models like Mixtral, DeepSeek, and Qwen use Mixture-of-Experts (MoE) — instead of one giant network, they have dozens of smaller "expert" networks and a router that picks which ones to use for each token. By design, only a fraction of experts are active at any time, so the rest are offloaded to CPU memory — this is intentional, not a limitation.

But managing that offloading is complex. Which experts to keep on GPU? When to prefetch the next ones? Where to run cache misses — wait for the GPU transfer, or fall back to CPU? And how to adapt as the workload shifts?

Every existing system hardcodes these decisions in hundreds of lines of C++/CUDA. MoE-PolicyLang replaces all of that with a small, declarative language.

Throughput and hit rate comparison across policies on consumer GPU


The Language

A MoE-PolicyLang policy is a .moe file with four composable blocks:

policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}
Block Controls Options
cache Which experts stay on GPU LRU, LFU, score-based, frequency-threshold
prefetch Proactive loading History, affinity, lookahead
schedule Where to run cache misses GPU-only, CPU-fallback, hybrid
adapt Runtime self-tuning Conditional rules that hot-swap components

Switching from LRU to LFU? Change one word. Adding prefetching? Two lines.


Two Lines to Attach

import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from your model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions

Or write a policy explicitly:

mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")

Or load a .moe file:

mgr = moe_policylang.attach(model, open("my_policy.moe").read())

Why a Language?

Python dicts could configure this. The DSL adds three things they can't:

  1. Static validation — 20 semantic rules catch bad policies at parse time, not mid-inference
  2. Portability.moe files are shareable, diffable, and tool-agnostic
  3. Constraint — you can't write arbitrary code in a scheduling policy; the grammar limits you to what makes sense

Results

The abstraction is effectively free

Dispatch overhead with 95% confidence intervals

All policies add < 3.2% overhead on A100 (6–47 µs/layer vs. 1,459 µs for MoE forward pass).

14–40× less code than published systems

System Their LOC MoE-PolicyLang Reduction
Fiddler 280 7 lines 40×
HybriMoE ~500 14 lines 36×
MoE-Infinity 520 16 lines 33×
vLLM 300 12 lines 25×
ExpertFlow ~400 16 lines 25×
FineMoE ~350 25 lines 14×

Bold LOC counts are measured from open-source repos (primary expert-management function or module — e.g., Fiddler's set_expert_loc() in src/fiddler/mixtral.py); others estimated from paper descriptions.

Policy selection produces measurable differences

Cache hit rate vs capacity for Mixtral and DeepSeek

Different policies → different real performance. Mixtral saturates quickly (8 experts); DeepSeek (64 experts) needs smarter strategies.

EPCB: Per-layer cache budgeting (with an honest negative result)

Not all layers see the same routing pattern — some concentrate on a few experts, others spread across many. Empirical Per-layer Cache Budgeting (EPCB) has two parts:

  1. Per-layer cache structure is the load-bearing lever: replacing a single shared cache with per-layer caches at the same total budget yields +14.7pp hit rate on DeepSeek-V2-Lite in trace replay, and +540% wall-clock on A100 end-to-end (1.60 → 10.22 tok/s, eliminating all CPU↔GPU transfers in steady state). Bit-identical output verified against fully-resident baseline.

Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer

  1. The allocation signal does not matter. We tested six signals (Shannon entropy, inverse top-k mass, inverse variance, inverse KL, inverse Gini, uniform) and none differentiates from uniform allocation by more than 2.5pp in hit rate at any budget, and all six collapse to within noise of uniform in wall-clock on two models tested end-to-end. We retain Shannon entropy as the default allocator for principled reasons, but recommend uniform allocation for simplicity in practice.

Per-layer entropy and capacity allocation

Strategy Hit Rate Δ vs shared Wall-clock (A100)
Shared cache (32 slots) 48.6% baseline 1.60 tok/s
Per-layer uniform (864 slots) 63.3% +14.7pp 10.22 tok/s (+540%)
Per-layer Shannon entropy (864 slots) 65.5% +16.9pp 10.17 tok/s (≈ uniform)

Regime caveat: per-layer caching only helps when the per-layer budget covers each layer's active working set. On 16 GB consumer hardware where the budget is tight (Qwen on RTX 5080), per-layer caching hurts throughput by 16%; flat shared caching is the better default in that regime.

When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region

Live inference on consumer GPU

OLMoE-1B-7B on RTX 5080 Laptop (16 GB VRAM):

Hit rate with bootstrap confidence intervals

Policy Cap Hit Rate tok/s
Vanilla (no hooks) 39.2
Naive LRU 4 2.4% 34.6
LRU 16 26.3% 34.7
LFU+History 16 27.1% 33.8
EPCB 16 47.3% 33.6

Installation

From PyPI:

pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[all]      # everything

From source (development):

git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"

Cython fast path (< 10 µs/layer):

pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace

Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace model — no model-specific code required. We have evaluated on:

Model Experts × Layers Routing Hardware
Mixtral-8×7B-Instruct 8 × 32 top-2 A100-80 GB
DeepSeek-V2-Lite 64 × 27 top-6 A100-80 GB
Qwen1.5-MoE-A2.7B 60 × 24 top-4 RTX 5080 (16 GB)
OLMoE-1B-7B 64 × 16 top-8 RTX 5080 (16 GB)

Project Structure

moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # EPCB — entropy-proportional caching
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers

Running Experiments

# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_epcb_sensitivity.py  # EPCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure

Tests

python -m pytest tests/ -q

398+ tests covering parsing, validation, compilation, runtime dispatch, adaptive policies, per-layer EPCB, and integration hooks.


Documentation

See docs/MANUAL.md for the full language reference, runtime API, and policy authoring guide.


Citation

@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}

License

MIT License — Copyright (c) 2026 Jesse Pokora

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moe_policylang-1.1.2.tar.gz (78.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

moe_policylang-1.1.2-py3-none-any.whl (91.1 kB view details)

Uploaded Python 3

File details

Details for the file moe_policylang-1.1.2.tar.gz.

File metadata

  • Download URL: moe_policylang-1.1.2.tar.gz
  • Upload date:
  • Size: 78.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for moe_policylang-1.1.2.tar.gz
Algorithm Hash digest
SHA256 6acc06c4d7db09b351fc80e6673099d09716183edf3eae1ecd50956c9a0dea71
MD5 e24059d842da8d3e1f60b76626072c85
BLAKE2b-256 61cdaa092e941d116261642a2ad0de6606d32f1a6d15edf6cbd4b1f6a3e6450f

See more details on using hashes here.

File details

Details for the file moe_policylang-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: moe_policylang-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 91.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for moe_policylang-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8417eca0cd620801df0e968443add4eb1638d8ae83d2782d3579c3d4b82880ad
MD5 281d1874dfd695cd67d951ccc5d6d73c
BLAKE2b-256 116a98cbb36506fe347e3d6a4434952cfde4394f104e77b05e9ac094c2213156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page