Skip to main content

ZMLX: Metal-kernel toolkit and optimization lab for MLX on Apple Silicon. Fused MoE decode (+5-12% on LFM2-8B-A1B), custom GPU kernels in one line, 70+ kernel catalog.

Project description

ZMLX — Metal kernels and model patching for MLX on Apple Silicon

PyPI Python 3.10+ License: MIT Platform: macOS Apple Silicon

ZMLX extends MLX with a Python-first Metal kernel toolkit and model-aware patching for faster MoE decode on Apple Silicon.

What ZMLX does

  • Metal kernels from Python: write elementwise("x * tanh(log(1 + exp(x)))") and get a compiled Metal kernel with caching, autograd support, and the 70+ kernel catalog.
  • Model patching: patch(model) replaces MoE gating/combine/activation sequences with fused Metal kernels, reducing dispatch overhead during decode. Token-identical output; verify with python -m zmlx.validate.
  • Proven on stock MLX: LFM2-8B-A1B shows +5-12% decode on released MLX with no custom builds needed. These gains come from ZMLX's own Metal kernels for fused gating, combine, and SwiGLU activation.

Quick Start

Requirements: macOS 14+ (Apple Silicon), Python >= 3.10, mlx>=0.30.0

  1. Install (patching examples use mlx-lm):
pip install "zmlx[train]"    # includes mlx-lm for model patching
# pip install zmlx            # kernel authoring only
  1. Patch a model and generate (no weight conversion; patches apply in-place):
import mlx_lm
from zmlx.patch import patch

model, tokenizer = mlx_lm.load("mlx-community/LFM2-8B-A1B-4bit")
patch(model)  # safe inference defaults for supported model families

print(
    mlx_lm.generate(
        model,
        tokenizer,
        prompt="Explain mixture-of-experts in one paragraph.",
        max_tokens=200,
    )
)
  1. Verify token fidelity + throughput on your hardware:
python -m zmlx.validate mlx-community/LFM2-8B-A1B-4bit --max-tokens 200 --runs 3

Tip: large model downloads use the Hugging Face cache; set HF_HOME to control its location.

What's Inside

  • Model patching: zmlx.patch.patch() (preset-based) and zmlx.patch.smart_patch() (auto-benchmark patterns).
  • Kernel authoring: zmlx.api.elementwise(), reduce(), map_reduce(), and @zmlx.jit.
  • Autograd support: optional custom VJP paths via MLX custom functions.
  • Benchmarking: zmlx.bench.compare() and python -m zmlx.bench.report (repro capsules in benchmarks/repro_capsules/).
  • Training CLI (optional): zmlx train.
  • Custom MLX primitive (opt-in): build a custom MLX with gather_qmm_swiglu (see docs/EXPERIMENTAL_MLX.md; patch lives in integrations/mlx_local_integration/).

exo Integration

ZMLX works with exo for faster GLM-4.7-Flash and Qwen3-30B-A3B decode in distributed inference clusters. Setup is automated:

git clone https://github.com/Hmbown/ZMLX.git
cd ZMLX
bash setup_zmlx.sh        # one-time setup (creates ./exo + ./exo/run_zmlx.sh)
bash exo/run_zmlx.sh      # launch exo with ZMLX

When GLM loads, ZMLX fuses all 46 MoE layers + 1 dense SwiGLU (~8% faster decode, token-identical) when the custom MLX primitive is available. See docs/EXO.md for the full guide.

Docs

Doc What's inside
docs/TOUR.md Quick walkthrough and how to verify results
docs/QUICKSTART.md 5-minute kernel authoring tutorial
docs/COOKBOOK.md Recipes for common patterns
docs/KERNELS.md Kernel catalog (by module/domain)
docs/BENCHMARKS.md Benchmark methodology + raw data
docs/ARCHITECTURE.md Design philosophy
docs/EXO.md exo integration guide (GLM/Qwen3)
docs/EXPERIMENTAL_MLX.md Custom MLX primitive details
UPSTREAM_PLAN.md What belongs upstream in MLX

Contributing / Development

See CONTRIBUTING.md for setup, testing, and conventions.

git clone https://github.com/Hmbown/ZMLX.git
cd ZMLX
pip install -e ".[dev]"
pytest

Benchmarks (stock MLX — works with pip install mlx)

These results use released MLX (pip install mlx). The speedup comes from ZMLX's own Python-level Metal kernels (fused gating, combine, SwiGLU activation) — no custom C++ or MLX fork required.

Full methodology and raw data: docs/BENCHMARKS.md.

Model Hardware Decode (baseline -> patched) Change Fidelity Capsule
LFM2-8B-A1B-4bit M4 Max 36 GB 223.5 tok/s -> 249.4 tok/s +11.6% token-identical benchmarks/repro_capsules/lfm2_m4max_20260131.json
LFM2-8B-A1B-8bit M4 Max 36 GB 152.5 tok/s -> 164.3 tok/s +7.7% token-identical benchmarks/repro_capsules/lfm2_m4max_20260131.json
LFM2-8B-A1B-4bit M1 Pro 16 GB 105.5 tok/s -> 115.3 tok/s +9.3% token-identical benchmarks/repro_capsules/lfm2_m1pro_20260131.json
LFM2-8B-A1B-8bit M1 Pro 16 GB 72.8 tok/s -> 76.4 tok/s +5.0% token-identical benchmarks/repro_capsules/lfm2_m1pro_20260131.json
GPT-OSS-20B-4bit M4 Max 36 GB 121.8 tok/s -> 122.9 tok/s +1.0% token-identical

To print a report from a capsule:

python -m zmlx.bench.report benchmarks/repro_capsules/<capsule>.json
Benchmarks (custom MLX primitive — requires building mlx_local/)

GLM-4.7-Flash and Qwen3-30B-A3B gains come from gather_qmm_swiglu, a custom C++ Metal primitive we wrote (~800 lines of C++/Metal). It fuses gate projection + up projection + SwiGLU activation for quantized MoE experts into a single GPU dispatch. This primitive is not part of released MLX — build it by applying the patch described in docs/EXPERIMENTAL_MLX.md.

ZMLX provides the model-side integration: auto-detecting MoE architectures, rewiring forward passes to use the fused primitive, and a deterministic no-FMA combine kernel to preserve token fidelity on GLM.

On stock MLX (released 0.30.4/0.30.5), ZMLX auto-skips these models (0 modules patched, 0% change) to avoid regressions. patch() is always safe to call.

Model Hardware Decode (baseline -> patched) Change Fidelity
GLM-4.7-Flash-4bit M4 Max 36 GB 85.8 tok/s -> 92.8 tok/s +8.1% 128/128 identical
Qwen3-30B-A3B-4bit M4 Max 36 GB 117 tok/s -> 123 tok/s +5.5% 128/128 identical

See docs/EXPERIMENTAL_MLX.md for build instructions.

Model support summary
Model Stock MLX + Custom primitive What ZMLX does
LFM2-8B-A1B +5-12% decode same ZMLX Metal kernels: fused MoE gating + combine + SwiGLU
GLM-4.7-Flash 0% (auto-skipped) +8% decode ZMLX patching + custom gather_qmm_swiglu primitive
Qwen3-30B-A3B 0% (auto-skipped) +6% decode ZMLX patching + custom gather_qmm_swiglu primitive
GPT-OSS-20B ~+1% same ZMLX Metal kernel: fused SwiGLU activation
Other models safe no-op same patch() returns unchanged if no patterns match

All results are token-identical under greedy decoding. Verify on your hardware with python -m zmlx.validate <model>.

Patching controls:

import mlx.core as mx
from zmlx.patch import patch, smart_patch

patch(model)                      # inference defaults (auto-skips unsafe patterns)
patch(model, mode="training")     # training preset (adds norms/residual fusions)
patch(model, patterns=["moe_mlp"])  # override safety; validate first

# Auto-benchmark: apply only patterns that actually help on your sample
sample = mx.array([tokenizer.encode("Hello")])
model = smart_patch(model, sample)
How patching works (MoE decode)

MoE decode is often dominated by Metal kernel dispatch overhead (many small ops per token).

ZMLX targets the multi-op sequences that show up during decode:

  • Gating: top-k softmax selection fused into one kernel (topk_gating_softmax).
  • Combine: weight-and-reduce across experts fused into one kernel (moe_combine).
  • Expert SwiGLU (when available): gate+up projection+SwiGLU fused into one dispatch via custom gather_qmm_swiglu primitive.
  • Guards: fused paths only activate at small sequence lengths (decode), keeping prefill throughput neutral.

Deeper dives:

Kernel authoring (very short example)

ZMLX can compile small Python expressions into Metal kernels via MLX's mx.fast.metal_kernel:

from zmlx.api import elementwise
import mlx.core as mx

mish = elementwise("x * tanh(log(1 + exp(x)))", name="mish")
y = mish(mx.random.normal((1024,)))
mx.eval(y)

Next steps:

Troubleshooting
Symptom Fix
ModuleNotFoundError: No module named 'mlx' Requires Apple Silicon macOS. ZMLX does not support Intel Macs or Linux.
ModuleNotFoundError: No module named 'mlx_lm' Install with pip install "zmlx[train]" for model patching examples.
Model downloads fill disk Set HF_HOME to a larger drive before running.
patch() shows 0 modules patched The model may not match any patterns, or ZMLX auto-skipped them for safety. Run python -m zmlx.validate <model> to verify.
GLM/Qwen shows 0 modules patched Expected on stock MLX. Requires building the custom gather_qmm_swiglu primitive in mlx_local/ (see docs).
Precision note

Most kernels compute internally in float32 regardless of input dtype. The exception is moe_combine_exact, which accumulates in the input dtype to match MLX's bfloat16 semantics for Qwen3. GLM uses moe_combine_no_fma to disable FMA contraction and match MLX's non-fused multiply-then-sum reduction order.


Acknowledgments

Built on MLX by Apple machine learning research. If you use ZMLX in your work, please also cite MLX:

@software{mlx2023,
  author = {Awni Hannun and Jagrit Digani and Angelos Katharopoulos and Ronan Collobert},
  title = {{MLX}: Efficient and flexible machine learning on Apple silicon},
  url = {https://github.com/ml-explore},
  version = {0.0},
  year = {2023},
}

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zmlx-0.8.0.tar.gz (258.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zmlx-0.8.0-py3-none-any.whl (147.1 kB view details)

Uploaded Python 3

File details

Details for the file zmlx-0.8.0.tar.gz.

File metadata

  • Download URL: zmlx-0.8.0.tar.gz
  • Upload date:
  • Size: 258.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zmlx-0.8.0.tar.gz
Algorithm Hash digest
SHA256 eca29dd76f61aa0668f091bc62264512f91bcc2fdd13f1cdeac4e1f6d1f1e585
MD5 1606311ee5e462fc8aa3eb59d206976e
BLAKE2b-256 86449df1e0c105f1a53b69cacd8184727215d57ba72d737b0feb870511a1f3cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for zmlx-0.8.0.tar.gz:

Publisher: release.yml on Hmbown/ZMLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zmlx-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: zmlx-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 147.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zmlx-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec62d5090f77cb24410f75ef8f50c8d9bbe8cd053f655789962ed59fb66b7dc1
MD5 b6e2c906ff29e7d9bd945a93f7cebfbe
BLAKE2b-256 749c5014c27dea1fd18a0782fe6856bc2ad2cf339ba00c11f08b88aa6ba41fab

See more details on using hashes here.

Provenance

The following attestation bundles were made for zmlx-0.8.0-py3-none-any.whl:

Publisher: release.yml on Hmbown/ZMLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page