A domain-specific language for Mixture-of-Experts scheduling policies

These details have not been verified by PyPI

Project links

Project description

MoE-PolicyLang

A scheduling language for Mixture-of-Experts models.

Author: Jesse Pokora · License: MIT

TL;DR

MoE models pack a lot of weights into "experts" but only fire a few per token. The rest can live in CPU RAM and get pulled to the GPU on demand. Which experts to keep, when to prefetch them, and what to do on a miss is a policy question, and every existing MoE serving system hardcodes its answer inside the runtime.

MoE-PolicyLang is a small declarative language for that policy. Swapping LRU for LFU is a one-word change; nothing else has to move.

On an RTX 5080 (16 GB), it runs Qwen1.5-MoE (28.6 GB fp16) at 4.61 tok/s — 8.1x faster than HuggingFace's device_map="auto" — with bit-identical output.

The problem

A Mixture-of-Experts layer replaces a single feed-forward block with N "experts" plus a small router. For each token the router picks the top-k experts (typically k=2 to k=8) and only those fire. Mixtral uses 8 experts per layer (top-2), Qwen1.5-MoE uses 60 (top-4), DeepSeek-V2-Lite uses 64 (top-6). Most experts sit idle on any given token, but the weights still have to be reachable in case the router picks them.

For a 28 GB MoE model on a 16 GB GPU, that means offloading: keep some expert weights on GPU, the rest in CPU RAM, and move weights across the PCIe bus when the router asks for one that isn't resident. PCIe is much slower than reading from GPU memory, so if every router pick is a miss you're stuck — HuggingFace's device_map="auto" runs Qwen1.5-MoE on a 5080 at 0.57 tok/s.

There are four interlocking decisions to make:

Decision	Question	Example strategies
Cache	Which experts stay on GPU?	LRU (drop least-recently-used), LFU (drop least-frequently-used), score-based
Prefetch	Which to load before they're requested?	Affinity (layer L → L+1 patterns), history, lookahead
Schedule	What to do on a cache miss?	Wait for the GPU transfer, run on CPU, decide per-call
Adapt	When to change strategy?	Conditional rules on runtime metrics

Every existing MoE serving system (ExpertFlow, Fiddler, MoE-Infinity, HybriMoE, ProMoE, FineMoE) hardcodes these four decisions inside its runtime. Changing any one strategy means reading and modifying the system's expert-management module: roughly 200 to 2,000 LOC depending on the system.

MoE-PolicyLang lets you write the policy as a short .moe file and attach it to a HuggingFace or vLLM model. The runtime hooks that consume the policy stay the same; only the policy text changes.

Throughput and hit rate comparison across policies on consumer GPU

The Language

A MoE-PolicyLang policy is a .moe file with four composable blocks:

policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}

Block	Controls	Strategies
cache	Which experts stay on GPU	LRU drops the least-recently-used; LFU drops the least-frequently-used with decay; score ranks by router gate value; freq-threshold keeps anything above a frequency cutoff
prefetch	Experts loaded before they're requested	History uses a running co-occurrence matrix; affinity uses layer L → L+1 patterns; lookahead peeks ahead in the router output
schedule	What happens on a cache miss	gpu-only waits for the transfer; cpu-fallback runs the missed expert on CPU; hybrid decides per-call based on estimated latency
adapt	Self-tuning at runtime	Conditional rules of the form `when <metric> <op> <value> for <window> { <override> }`

Switching from LRU to LFU is a one-word change. Adding prefetching is two lines.

Attaching to a model

import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from the model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions

Or write a policy explicitly:

mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")

Or load a .moe file:

mgr = moe_policylang.attach(model, open("my_policy.moe").read())

attach() parses the policy, runs the validator, compiles it to a PolicyHook, and registers forward hooks on every MoE layer of the model. From that point on, normal model.generate() calls trigger the hooks.

How a policy runs at runtime

On each MoE layer, after the router picks its top-k experts:

Look up each picked expert in the GPU cache (hit or miss).
For misses, decide whether to wait for a CPU→GPU transfer or run that expert on CPU. The schedule block picks the policy.
Insert newly loaded experts into the cache. If full, the eviction rule drops something.
Prefetch experts the next few layers are likely to want.
Run memory-pressure and TTL eviction triggers.

The hook is plain Python. It adds 6 µs/layer (simple LRU) to 47 µs/layer (composed policy with triggers), against an MoE forward-pass baseline of about 1,500 µs on A100 — under 3.2% of layer time.

The five steps above are the always-on layer of dynamism: cache contents change on every dispatch. PLCB adds a second, opt-in layer — set rebalance_interval > 0 in a per_layer block and the per-layer budget itself drifts at runtime based on observed routing entropy. adapt blocks add a third, opt-in layer — a threshold like when hit_rate < 0.4 rewrites the policy in place (deep-copy IR, validate, recompile, hot-swap the hook) without pausing inference. You pick how much dynamism you want; uniform allocation with no adapt rules is the static default.

Why a Language, Not YAML?

The cache, prefetch, and schedule blocks are key-value config and you could handle them with a JSON schema. The reason the grammar pays off is PLCB and the adapt block.

PLCB describes a per-layer cache layout, not a single cache:

per_layer {
    allocation         = uniform
    total_budget       = 864
    rebalance_interval = 500
    min_capacity       = 4
    max_capacity       = 48
}

"27 separate caches at total budget 864 with optional entropy allocation" is awkward to write as flat key-value pairs. The DSL gives it a block.

adapt is the other one. Hot-swap rules monitor metrics and rewrite the policy at runtime:

adapt {
    when hit_rate < 0.4 for 100 accesses { eviction = lru }
}

That's a conditional, not a config value. The grammar constrains what you can write (no arbitrary code in a scheduling policy), and 20 semantic rules catch bad policies at parse time instead of mid-inference.

A Python eDSL (@sched.policy decorator) and an auto_attach API are also available for programmatic construction and zero-config deployment.

Results

Live inference on consumer GPU

When the model doesn't fit: Qwen1.5-MoE-A2.7B (~28.6 GB fp16) on RTX 5080 (16 GB VRAM). Without MoE-PolicyLang, the only option is device_map="auto" at 0.57 tok/s. With a 4-line DSL policy:

Config	Strategy	Cap	VRAM	tok/s	95% CI
Baseline (`auto`)	—	—	12.0 GB	0.57±0.00	—
Skeleton	LRU (cap=1)	1	4.7 GB	4.23±0.22	[4.04, 4.42]
Aggressive	LRU	2	5.2 GB	4.17±0.03	[4.14, 4.20]
Balanced	LFU+hist.	4	7.3 GB	4.35±0.06	[4.30, 4.40]
Generous	LFU+hist.	8	10.1 GB	4.61±0.08	[4.54, 4.68]

Cost-performance: 4.61 tok/s on a $1k consumer GPU is comparable in absolute throughput to Fiddler's 4.17 tok/s on a $15k A100, though the models are different and the numbers are not directly comparable.

Decomposition: roughly 92% of the 8.1x speedup comes from expert-aware loading (skeleton on GPU, experts on CPU). Even a capacity-1 "every dispatch is a miss" config reaches 4.23 tok/s (7.4x). Caching adds the remaining +0.38 tok/s. The DSL's contribution is not the loading mechanism (any system could implement that) but the remaining 8%: the policy layer that decides what to cache, evict, and prefetch, accessible without runtime modification and adaptable at runtime via adapt rules that no static config expresses. On models with more experts, the policy layer's share grows. On DeepSeek (A100), matched-budget per-layer allocation gains +14.7pp hit rate over flat caching at the same total slot count, a pure policy-structure effect with no capacity confound.

n=5, bootstrap 95% CIs. For output correctness: greedy decoding (do_sample=False) produces bit-identical token sequences across all policy configs vs device_map="auto" baseline (4 prompts x 3 policies = 12 comparisons); perplexity on wikitext-2 matches within 0.024%.

When the model fits (overhead measurement): OLMoE-1B-7B (~14 GB) fits entirely on 16 GB VRAM. There vanilla (no hooks) is fastest at 39.2 tok/s, with the policy hooks adding 12-14% overhead. This is not the target scenario; it measures overhead when there is nothing to offload. MoE-PolicyLang is for models that don't fit.

Hit rate with bootstrap confidence intervals

PLCB: Per-Layer Cache Budgeting (with a negative result)

A flat cache holds, say, 32 expert weights total, shared across all 27 MoE layers of DeepSeek-V2-Lite. If layer 0 is hot, LFU keeps its experts, and layers that haven't appeared recently get evicted. In steady state, a flat cap=32 cache on DeepSeek covers only ~11 of the 27 layers — the other 16 have zero experts cached and miss every dispatch.

A per-layer cache splits the same total budget (864 = 27×32 slots) into one cache per layer. Each layer keeps its own hot experts.

There's a positive result and a caveat.

The caveat first: per-layer caching only helps when each layer's budget covers that layer's working set. On 16 GB consumer hardware the per-layer budgets are too small for that, and the aggregated cache pushes the CUDA allocator to the VRAM ceiling — throughput drops by about 16%. Flat shared caching is the default for memory-constrained deployments. Per-layer wins with VRAM headroom and high expert counts (DeepSeek-V2-Lite on A100, below).

When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region

When the regime permits, per-layer cache structure is what matters. At matched total budget on DeepSeek-V2-Lite (A100-80GB), replacing a shared cache with per-layer caches gives +14.7pp hit rate in offline trace replay and eliminates all CPU/GPU transfers in steady state. Output is bit-identical to the fully-resident baseline.

The headline throughput gain (1.60 to 10.22 tok/s, +540%) compares shared-32 to per-layer-864, which is 27x more total slots. The matched-budget +14.7pp hit rate and transfer elimination are the core findings; the 540% wall-clock number folds in the capacity expansion.

Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer

This structural difference maps directly onto MoE-aware baselines. Fiddler's expert placement is a hardcoded global popularity ranking, which is structurally equivalent to the flat cache on the left. On an A100-80GB where 85% of Mixtral's experts fit on-device (217/256), the ranking barely matters because almost everything is resident. On a constrained GPU where only a fraction of experts fit, a global ranking starves cold layers (left heatmap), while a per-layer policy maintains coverage at every layer (right heatmap). Per-layer caching enables topologies that a flat global ranking cannot express. The mechanical payoff is PCIe stall elimination: expert offloading is memory-bandwidth-bound, so every cache miss costs a CPU-to-GPU transfer. When per-layer caches cover each layer's working set, steady-state misses drop to zero, which is why a hit-rate improvement turns into a 6.4x wall-clock gain (10.22 vs 1.60 tok/s on DeepSeek-V2-Lite at matched total budget).

The allocation signal does not matter. We tested six signals (Shannon entropy, inverse top-k mass, inverse variance, inverse KL, inverse Gini, uniform). None differentiates from uniform by more than 2.5pp in hit rate, and all six collapse to within noise of uniform in wall-clock on two models. Uniform is the default. Shannon entropy is opt-in for models with high inter-layer entropy spread (ΔH around 1 nat or more), but it was within noise of uniform on every model tested end-to-end.

Per-layer entropy and capacity allocation

Strategy	Total slots	Hit Rate	Δ vs shared	Wall-clock (A100)
Shared cache	32	48.6%	baseline	1.60 tok/s
Per-layer uniform	864 (27x)	63.3%	+14.7pp	10.22 tok/s
Per-layer entropy	864 (27x)	65.5%	+16.9pp	10.17 tok/s (~ uniform)

Mechanism vs policy: Fiddler head-to-head (A100-80GB, Mixtral-8x7B)

Fiddler and MoE-PolicyLang on the same hardware, model, prompt, and methodology (n=5, greedy decoding, 64 tokens):

Config	tok/s (±σ)	95% CI	GPU Peak	Hit Rate	Transfers
Fiddler	4.17 ± 0.02	[4.16, 4.18]	80.6 GB	88.3%	—
MPL fiddler_equiv (cap=2)	0.18 ± 0.00	[0.18, 0.18]	6.4 GB	19.5%	4,283
MPL balanced (cap=4)	0.29 ± 0.00	[0.29, 0.29]	39.5 GB	46.4%	2,665
MPL generous (cap=6)	0.45 ± 0.00	[0.45, 0.45]	61.7 GB	71.0%	1,726

All MPL configs produce bit-identical output. Fiddler is 9-23x faster.

The gap is mechanism, not policy. Fiddler uses an optimized C++/CUDA transfer pipeline with pre-allocated GPU memory slots and direct DMA. MoE-PolicyLang dispatches through Python-level Tensor.to() calls in the HuggingFace forward pass. At Fiddler's 85% GPU residency (217/256 experts on-device), placement strategy isn't doing the work; the model mostly fits.

MoE-PolicyLang is a policy specification layer, not a serving system. It specifies which experts to cache, evict, and prefetch, but does not implement the physical mechanism that moves expert tensors between devices.

vLLM backend: same policy, different mechanism

A second backend integrates with vLLM for production-grade quantized MoE inference. VLLMPolicyRunner instruments vLLM's router layers to capture expert routing decisions and feeds them through the same DSL, compiler, and hooks that the HuggingFace backend uses.

from moe_policylang.integrations.vllm_backend import VLLMPolicyRunner

runner = VLLMPolicyRunner(
    model="Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    policy_dsl='''
        policy demo {
            cache { capacity = 8  eviction = lru }
            prefetch { strategy = lookahead  lookahead = 1 }
            schedule { mode = gpu_only }
        }
    ''',
    quantization="gptq",
)

results = runner.generate(["What is expert routing?"], max_tokens=30)
print(results["text"])          # generated text
print(results["policy_stats"])  # cache hits, prefetch accuracy, etc.

Verified on RTX 5080 (16 GB), vLLM 0.21, GPTQ-Int4 quantization. Captures 744 routing events across 24 layers × 60 experts, with a 14.7% cache hit rate and 72% prefetch accuracy from a minimal 8-slot LRU policy.

The same .moe policy text runs on both backends; only the mechanism layer (HF eager Python vs vLLM optimized kernels) differs. Combined with the Fiddler head-to-head above, this is the empirical case for the policy/mechanism separation — the abstraction holds across two production mechanism layers, not just structurally.

Cache hit rates: capacity sweeps

Cache hit rate vs capacity for Mixtral and DeepSeek

Capacity sweeps on offline traces:

Mixtral-8x7B (8 experts, top-2) saturates at cap=8 with around 100% hit rate, since all experts fit. Policy choice barely matters here.
DeepSeek-V2-Lite (64 experts, top-6) reaches only 51% hit rate at cap=32 (half the experts). LFU consistently beats LRU across budgets because DeepSeek has significant frequency skew (some experts activated 3-5x more often). This is the regime where policy selection and per-layer budgeting make a measurable difference.

Policy authoring effort

To add a new policy variant to one of these systems, a developer needs to read and modify the system's expert-management module. MoE-PolicyLang replaces that with a short .moe file. The 14–40x numbers below count lines a user writes to express a policy; they do not include MoE-PolicyLang's own runtime, which is around 4,300 LOC.

System	Expert-mgmt module	DSL equivalent	Authoring reduction
Fiddler	280 LOC	7 lines	40x
HybriMoE	~500 LOC	14 lines	36x
MoE-Infinity	520 LOC	16 lines	33x
vLLM	300 LOC	12 lines	25x
ExpertFlow	~400 LOC	16 lines	25x
FineMoE	~350 LOC	25 lines	14x

Methodology: non-blank, non-comment lines in the primary expert-management module. Measured sources: Fiddler from set_expert_loc() + execute_fiddler() in src/fiddler/mixtral.py (280 LOC); MoE-Infinity from expert_prefetcher.py + expert_cache.py (520 LOC); vLLM from MixtralMoE expert dispatch in vllm/model_executor/ (300 LOC). Counts marked ~ are estimated from paper descriptions of closed-source systems. Switching between strategies (LRU to LFU) is a one-word change in the DSL versus rewriting cache data structures in the hand-coded version.

Dispatch overhead

Dispatch overhead with 95% confidence intervals

Per-layer dispatch (the Python hook that decides cache/evict/prefetch) adds under 3.2% of MoE forward-pass time on A100: 6–47 µs/layer against a 1,459 µs baseline. This is the policy decision overhead; cache misses and weight transfers are accounted for separately and depend on the policy and workload.

Hit rate with bootstrap confidence intervals

Installation

From PyPI:

pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[vllm]     # + vLLM (GPTQ/AWQ quantized inference)
pip install moe-policylang[all]      # everything

For quantized models, use the [vllm] extra. vLLM handles GPTQ and AWQ quantization with optimized kernels; MoE-PolicyLang observes routing decisions and applies policy logic without managing the tensors directly.

For Blackwell GPUs (RTX 5080/5090), set these env vars before running vLLM:

export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2

From source (development):

git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"

Cython fast path (for complex policies):

pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace

Python dispatch ranges from 6 µs/layer (simple LRU) to 47 µs/layer (composed policies with triggers). The Cython path targets the high end: freq_threshold and composed_full drop from 28-47 µs to under 10 µs/layer. Simple policies like lru_basic (6 µs) see no benefit.

Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace model with no model-specific code required. Evaluated on:

Model	Experts x Layers	Routing	Hardware	Backend
Mixtral-8x7B-Instruct	8 x 32	top-2	A100-80 GB	HF Transformers
DeepSeek-V2-Lite	64 x 27	top-6	A100-80 GB	HF Transformers
Qwen1.5-MoE-A2.7B	60 x 24	top-4	RTX 5080 (16 GB)	HF Transformers
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4	60 x 24	top-4	RTX 5080 (16 GB)	vLLM
OLMoE-1B-7B	64 x 16	top-8	RTX 5080 (16 GB)	HF Transformers

Project Structure

moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # PLCB — per-layer cache budgeting
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── vllm_backend.py     # vLLM integration (routing trace + policy replay)
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers

Running Experiments

# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_plcb_sensitivity.py  # PLCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure

Tests

python -m pytest tests/ -q

453+ tests covering parsing, validation, compilation, runtime dispatch, adaptive policies, per-layer PLCB, and integration hooks.

Documentation

See docs/MANUAL.md for the full language reference, runtime API, and policy authoring guide.

Glossary

MoE (Mixture-of-Experts). A Transformer layer that replaces a single feed-forward block with N expert networks plus a small router that picks the top-k experts per token.
Expert. One feed-forward sub-network inside an MoE layer. Mixtral has 8 per layer, Qwen1.5-MoE has 60, DeepSeek-V2-Lite has 64.
Router / top-k routing. The small classifier inside each MoE layer that scores experts and picks the k highest per token.
Offloading. Keeping some weights in CPU RAM and moving them to GPU on demand. The cost is the PCIe transfer.
PCIe. The bus between CPU memory and the GPU. Roughly two orders of magnitude slower than reading from GPU HBM/GDDR, so cache misses are expensive.
Skeleton. Everything in the model that isn't an expert: embeddings, attention, layer norms, LM head. MoE-PolicyLang pins the skeleton on GPU (≈3.7 GB for Qwen1.5-MoE) and only the expert weights move.
Cache hit / miss. A hit means the expert the router picked is already on GPU. A miss means we have to fetch it (or run it on CPU, if schedule = cpu_fallback).
LRU / LFU. Least-Recently-Used and Least-Frequently-Used cache eviction. LRU drops whatever hasn't been touched lately; LFU drops whatever has the lowest activation count (with a decay factor so old hot experts age out).
fp16 / GPTQ / AWQ. Weight precisions. fp16 is the standard half-precision format used in this paper's experiments. GPTQ and AWQ are 4-bit quantization formats that vLLM consumes; they trade a small amount of perplexity for a large VRAM reduction.
KV-cache. The cache of attention keys/values from previous tokens during generation. It grows with sequence length and competes with expert weights for VRAM.
pp (percentage points). Used for hit-rate deltas: +14.7 pp means 48.6% → 63.3%, not a 14.7% relative change.
PLCB. Per-Layer Cache Budgeting. See the section above; the load-bearing part is the per-layer cache structure, not the entropy allocator that gave the technique its name.

Citation

@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.4.2

May 21, 2026

1.4.1

May 21, 2026

1.4.0

May 21, 2026

1.3.2

May 21, 2026

1.3.1

May 21, 2026

1.3.0

May 21, 2026

1.2.5

May 19, 2026

1.2.4

May 18, 2026

1.2.3

May 18, 2026

1.2.2

May 18, 2026

1.2.1

May 18, 2026

1.2.0

May 18, 2026

1.1.2

May 18, 2026

1.1.1

May 18, 2026

1.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moe_policylang-1.4.2.tar.gz (99.6 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

moe_policylang-1.4.2-py3-none-any.whl (107.4 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file moe_policylang-1.4.2.tar.gz.

File metadata

Download URL: moe_policylang-1.4.2.tar.gz
Upload date: May 21, 2026
Size: 99.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for moe_policylang-1.4.2.tar.gz
Algorithm	Hash digest
SHA256	`29aa69b13d0de2af16cc27b28cf5148249d124f0c5a6ccfa690da18bd9cc9c22`
MD5	`825b1a815d37216ebe54a09da4b14bfb`
BLAKE2b-256	`77be1ad6f2d39f8da06435f5f6d27e0addbb49c5cd343ea2c6fefa43a46e17b6`

See more details on using hashes here.

File details

Details for the file moe_policylang-1.4.2-py3-none-any.whl.

File metadata

Download URL: moe_policylang-1.4.2-py3-none-any.whl
Upload date: May 21, 2026
Size: 107.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for moe_policylang-1.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2df1fdef355f67ca6dea8422d175f58641bc7ffef090008e80f57cbd3e8f52c9`
MD5	`7fc2b4604d22df4a59cac0b89b11cc0d`
BLAKE2b-256	`0e3a63da68eaab23a495f27abb2ce0b1e5a0fe5fb45ae49ca9c5c5e2579ff798`

See more details on using hashes here.

moe-policylang 1.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MoE-PolicyLang

TL;DR

The problem

The Language

Attaching to a model

How a policy runs at runtime

Why a Language, Not YAML?

Results

Live inference on consumer GPU

PLCB: Per-Layer Cache Budgeting (with a negative result)

Mechanism vs policy: Fiddler head-to-head (A100-80GB, Mixtral-8x7B)

vLLM backend: same policy, different mechanism

Cache hit rates: capacity sweeps

Policy authoring effort

Dispatch overhead

Installation

Tested Models

Project Structure

Running Experiments

Tests

Documentation

Glossary

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes