ZMLX: Metal-kernel toolkit and optimization lab for MLX on Apple Silicon. Fused MoE decode (+2-12% on LFM2/Qwen3.5), custom GPU kernels in one line, 70+ kernel catalog.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hmbown

These details have not been verified by PyPI

Project description

ZMLX — Metal kernels and model patching for MLX on Apple Silicon

ZMLX extends MLX with a Python-first Metal kernel toolkit and model-aware patching for faster MoE decode on Apple Silicon.

What ZMLX does

Metal kernels from Python: write elementwise("x * tanh(log(1 + exp(x)))") and get a compiled Metal kernel with caching, autograd support, and the 70+ kernel catalog.
Model patching: patch(model) replaces MoE gating/combine/activation sequences with fused Metal kernels, reducing dispatch overhead during decode. Token-identical output; verify with python -m zmlx.validate.
Works with stock MLX: LFM2-8B (+12%) and LFM2-24B (+7%) show consistent decode gains with pip install mlx — no custom builds required.
Qwen3.5-35B-A3B support (new): patch(model) auto-detects Qwen3.5's hybrid DeltaNet+Attention MoE architecture and applies fused MoE decode. ~+2% decode on M4 Max 36GB, token-identical. Your results may vary depending on hardware.
Optional custom primitive (GLM/Qwen3): build the custom gather_qmm_swiglu primitive to fuse quantized expert projections for GLM-4.7-Flash and Qwen3-30B-A3B. See docs/EXPERIMENTAL_MLX.md. On stock MLX these models auto-skip safely.

Measured Results

All numbers below are on M4 Max 36GB with greedy decoding. Your results will vary depending on hardware, thermal state, and prompt length. Verify on your machine with python -m zmlx.validate <model>.

Stock MLX (works with `pip install mlx`)

Model	Decode	Prefill	Fidelity
LFM2-8B-A1B-4bit	+12.8% (197.8 -> 223.2 tok/s)	neutral	token-identical
LFM2-24B-A2B-4bit	+6.0% (152.0 -> 161.1 tok/s)	neutral	token-identical
Qwen3.5-35B-A3B-4bit	~+2% (~36.2 -> ~36.8 tok/s)	~+4%	token-identical
GPT-OSS-20B-4bit	+1.0% (121.8 -> 122.9 tok/s)	neutral	token-identical

Custom MLX primitive (requires building `mlx_local/`)

Model	Decode	Change	Fidelity
GLM-4.7-Flash-4bit	+6.2% (200 tok), +6.7% (1024 tok)	~+6.4%	PASS

See docs/EXPERIMENTAL_MLX.md for build instructions.

Full methodology, raw data, and repro capsules: docs/BENCHMARKS.md and benchmarks/repro_capsules/.

Quick Start

Requirements: macOS 14+ (Apple Silicon), Python >= 3.10, mlx>=0.30.0

Install (patching examples use mlx-lm):

pip install "zmlx[lm]"       # includes mlx-lm for model patching
# pip install zmlx            # kernel authoring only

Patch a model and generate (no weight conversion; patches apply in-place):

import mlx_lm
from zmlx.patch import patch

# Works with any supported model — just change the model ID
model, tokenizer = mlx_lm.load("LiquidAI/LFM2-24B-A2B-MLX-4bit")
patch(model)  # auto-detects model family, applies safe optimizations

print(
    mlx_lm.generate(
        model,
        tokenizer,
        prompt="Explain mixture-of-experts in one paragraph.",
        max_tokens=200,
    )
)

That's it. patch(model) handles everything automatically — model detection, kernel selection, and safety checks. No env vars or configuration needed.

Verify token fidelity + throughput on your hardware:

# LFM2-24B (+7% on M4 Max)
python -m zmlx.validate LiquidAI/LFM2-24B-A2B-MLX-4bit --max-tokens 200 --runs 3

# LFM2-8B (+12% on M4 Max)
python -m zmlx.validate mlx-community/LFM2-8B-A1B-4bit --max-tokens 200 --runs 3

One-command smoke inference (loads model, applies zmlx.patch.patch(model), then generates):

source .venv/bin/activate && python examples/inference_smoke.py --model-id <model> --prompt "<prompt>" --max-tokens 64

Expected output shape:

[load] model=<model>
[patch] Applying zmlx.patch.patch(model) with safe defaults
[patch] Patched ...
[generate] prompt='...' max_tokens=64
[output] followed by generated text

Tip: large model downloads use the Hugging Face cache; set HF_HOME to control its location.

What's Inside

Model patching: zmlx.patch.patch() (preset-based) and zmlx.patch.smart_patch() (auto-benchmark patterns).
Kernel authoring: zmlx.api.elementwise(), reduce(), map_reduce(), and @zmlx.jit.
Autograd support: optional custom VJP paths via MLX custom functions.
Benchmarking: zmlx.bench.compare() and python -m zmlx.bench.report (repro capsules in benchmarks/repro_capsules/).
Custom MLX primitive (opt-in): build a custom MLX with gather_qmm_swiglu (see docs/EXPERIMENTAL_MLX.md; patch lives in integrations/mlx_local_integration/).

exo Integration

ZMLX works with exo for faster GLM-4.7-Flash and Qwen3-30B-A3B decode. No source patching needed.

From a ZMLX checkout (recommended; clones exo into ./exo and generates exo/run_zmlx.sh):

bash setup_zmlx.sh
bash exo/run_zmlx.sh

If exo is already installed in your environment:

pip install zmlx
zmlx-exo

For GLM/Qwen3 speedups, first build the optional custom MLX primitive (gather_qmm_swiglu) per docs/EXPERIMENTAL_MLX.md, then re-run bash setup_zmlx.sh so the exo venv picks it up.

ZMLX hooks into exo's model loading at runtime — when GLM/Qwen3 load with the custom MLX primitive, MoE expert dispatch is fused. Measured speedups vary by prompt/length; see docs/EXO.md and repro capsules in benchmarks/repro_capsules/.

Docs

Doc	What's inside
`docs/TOUR.md`	Quick walkthrough and how to verify results
`docs/QUICKSTART.md`	5-minute kernel authoring tutorial
`docs/COOKBOOK.md`	Recipes for common patterns
`docs/KERNELS.md`	Kernel catalog (by module/domain)
`docs/KNOWLEDGE_BASE.md`	Canonical KB schema, rebuild, and validation
`docs/FOUNDRY.md`	Kernel template evaluation, dataset generation, SFT export
`docs/kernel_discovery.md`	Hamiltonian-guided fused-boundary kernel discovery (`zmlx.kd`)
`docs/BENCHMARKS.md`	Benchmark methodology + raw data
`docs/ARCHITECTURE.md`	Design philosophy
`docs/EXO.md`	exo integration guide (GLM/Qwen3)
`docs/EXPERIMENTAL_MLX.md`	Custom MLX primitive details
`UPSTREAM_PLAN.md`	What belongs upstream in MLX

Contributing / Development

See CONTRIBUTING.md for setup, testing, and conventions.

git clone https://github.com/Hmbown/ZMLX.git
cd ZMLX
pip install -e ".[dev]"
pytest

Benchmarks (stock MLX — works with pip install mlx)

These results use released MLX (pip install mlx). The speedup comes from ZMLX's own Python-level Metal kernels (fused gating, combine, SwiGLU activation) — no custom C++ or MLX fork required.

Full methodology and raw data: docs/BENCHMARKS.md.

Model	Hardware	Decode (baseline -> patched)	Change	Fidelity	Capsule
LFM2-8B-A1B-4bit	M4 Max 36 GB	197.8 tok/s -> 223.2 tok/s	+12.8%	token-identical	`benchmarks/repro_capsules/lfm2_m4max_20260205_rerun_mlx0304dev2f324cc.json`
LFM2-8B-A1B-4bit	M1 Pro 16 GB	105.5 tok/s -> 115.3 tok/s	+9.3%	token-identical	`benchmarks/repro_capsules/lfm2_m1pro_20260131.json`
LFM2-24B-A2B-4bit	M4 Max 36 GB	152.0 tok/s -> 161.1 tok/s	+6.0%	token-identical (500 tok)	`benchmarks/repro_capsules/lfm2_24b_dsimd_gate_m4max_20260224.json`
GPT-OSS-20B-4bit	M4 Max 36 GB	121.8 tok/s -> 122.9 tok/s	+1.0%	token-identical	—

To print a report from a capsule:

python -m zmlx.bench.report benchmarks/repro_capsules/<capsule>.json

Benchmarks (custom MLX primitive — requires building mlx_local/)

Any GLM/Qwen3 improvements on custom MLX come from gather_qmm_swiglu, a custom C++ Metal primitive we wrote (~800 lines of C++/Metal). It fuses gate projection + up projection + SwiGLU activation for quantized MoE experts into a single GPU dispatch. This primitive is not part of released MLX — build it by applying the patch described in docs/EXPERIMENTAL_MLX.md.

ZMLX provides the model-side integration: auto-detecting MoE architectures, rewiring forward passes to use the fused primitive, and using native MLX combine ops on GLM/Qwen3 for fidelity and lower dispatch overhead.

On stock MLX (released 0.30.4/0.30.5), ZMLX auto-skips these models (0 modules patched, 0% change) to avoid regressions. patch() is always safe to call.

Model	Recommended config	Overall decode gain vs unpatched baseline	Fidelity	Evidence
GLM-4.7-Flash-4bit-mxfp4	`glm_combine_fp32_no_fma`	`+6.2%` (200), `+6.7%` (1024), `~+6.4%` average	PASS	`benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t200_r2_summary.json`, `benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t1024_r2_summary.json`, `benchmarks/repro_capsules/benchmark_vs_baseline_followup_20260211.json`

Qwen3-30B-A3B: no candidate is promoted yet; keep control baseline until a clear decode-positive variant is reproduced.

See docs/EXPERIMENTAL_MLX.md for build instructions. Repro capsules in benchmarks/repro_capsules/.

Model support summary

Model	Stock MLX	+ Custom primitive	What ZMLX does
LFM2-8B-A1B	+12% decode	same	Fused MoE gating + combine + SwiGLU activation
LFM2-24B-A2B	+6-7% decode	same	D-SIMD fused gating kernel (64 experts, K=4)
Qwen3.5-35B-A3B	~+2% decode	same	Fused MoE dispatch (256 experts, K=8, hybrid DeltaNet+Attention)
GLM-4.7-Flash	0% (auto-skipped)	~+6% decode	ZMLX patching + custom `gather_qmm_swiglu` primitive
Qwen3-30B-A3B	0% (auto-skipped)	speedup	ZMLX patching + custom `gather_qmm_swiglu` primitive
GPT-OSS-20B	fused SwiGLU activation	same	ZMLX Metal kernel: fused SwiGLU activation
Other models	safe no-op	same	`patch()` returns unchanged if no patterns match

All results are token-identical under greedy decoding. Verify on your hardware with python -m zmlx.validate <model>.

Patching controls:

import mlx.core as mx
from zmlx.patch import patch, smart_patch

patch(model)                      # inference defaults (auto-skips unsafe patterns)
patch(model, patterns=["moe_mlp"])  # override safety; validate first

# Auto-benchmark: apply only patterns that actually help on your sample
sample = mx.array([tokenizer.encode("Hello")])
model = smart_patch(model, sample)

How patching works (MoE decode)

MoE decode is often dominated by Metal kernel dispatch overhead (many small ops per token).

ZMLX targets the multi-op sequences that show up during decode:

Gating: top-k softmax selection fused into one kernel (topk_gating_softmax).
Combine: weight-and-reduce across experts fused into one kernel (moe_combine).
Expert SwiGLU (when available): gate+up projection+SwiGLU fused into one dispatch via custom gather_qmm_swiglu primitive.
Guards: fused paths only activate at small sequence lengths (decode), keeping prefill throughput neutral.

Deeper dives:

Walkthrough: docs/TOUR.md
Design notes: docs/ARCHITECTURE.md

Kernel authoring (very short example)

ZMLX can compile small Python expressions into Metal kernels via MLX's mx.fast.metal_kernel:

from zmlx.api import elementwise
import mlx.core as mx

mish = elementwise("x * tanh(log(1 + exp(x)))", name="mish")
y = mish(mx.random.normal((1024,)))
mx.eval(y)

Next steps:

5-minute tutorial: docs/QUICKSTART.md
Recipes: docs/COOKBOOK.md
Catalog: docs/KERNELS.md

Troubleshooting

Symptom	Fix
`ModuleNotFoundError: No module named 'mlx'`	Requires Apple Silicon macOS. ZMLX does not support Intel Macs or Linux.
`ModuleNotFoundError: No module named 'mlx_lm'`	Install with `pip install "zmlx[lm]"` for model patching examples.
Model downloads fill disk	Set `HF_HOME` to a larger drive before running.
`patch()` shows 0 modules patched	The model may not match any patterns, or ZMLX auto-skipped them for safety. Run `python -m zmlx.validate <model>` to verify.
GLM/Qwen shows 0 modules patched	Expected on stock MLX. Requires building the custom `gather_qmm_swiglu` primitive in `mlx_local/` (see docs).

Precision note

Most kernels compute internally in float32 regardless of input dtype. The exception is moe_combine_exact, which accumulates in the input dtype to match MLX's bfloat16 semantics. GLM and Qwen3 use native MLX ops for the combine step ((y * scores[..., None]).sum(axis=-2)) to match the original model code exactly and avoid custom-kernel dispatch overhead.

Acknowledgments

Built on MLX by Apple machine learning research. If you use ZMLX in your work, please also cite MLX:

@software{mlx2023,
  author = {Awni Hannun and Jagrit Digani and Angelos Katharopoulos and Ronan Collobert},
  title = {{MLX}: Efficient and flexible machine learning on Apple silicon},
  url = {https://github.com/ml-explore},
  version = {0.0},
  year = {2023},
}

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hmbown

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.10.0

Mar 3, 2026

0.9.2

Feb 25, 2026

0.9.0

Feb 24, 2026

0.8.4

Feb 10, 2026

0.8.3

Feb 8, 2026

0.8.2

Feb 7, 2026

0.8.0

Feb 4, 2026

0.7.12

Feb 1, 2026

0.7.11

Feb 1, 2026

0.7.1

Jan 31, 2026

0.7.0

Jan 31, 2026

0.6.1

Jan 30, 2026

0.6.0

Jan 30, 2026

0.4.2

Jan 30, 2026

0.2.1

Jan 30, 2026

0.2.0

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zmlx-0.10.0.tar.gz (738.0 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zmlx-0.10.0-py3-none-any.whl (447.9 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file zmlx-0.10.0.tar.gz.

File metadata

Download URL: zmlx-0.10.0.tar.gz
Upload date: Mar 3, 2026
Size: 738.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zmlx-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`0512b913e9618e806e1eb2074776b22a85e61b24240875a2aa53991c85fae6ab`
MD5	`12d958cf401ce4d9c37d58c1b54697b6`
BLAKE2b-256	`8a2e075e9da8df1b80fa5356aac3c6a8b2438927285b297a97e5a0e7cdef96af`

See more details on using hashes here.

Provenance

The following attestation bundles were made for zmlx-0.10.0.tar.gz:

Publisher: release.yml on Hmbown/ZMLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: zmlx-0.10.0.tar.gz
- Subject digest: 0512b913e9618e806e1eb2074776b22a85e61b24240875a2aa53991c85fae6ab
- Sigstore transparency entry: 1019795017
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: Hmbown/ZMLX@0a76ccc5953c4c738793fbcc1ebdbeae323f626d
- Branch / Tag: refs/tags/v0.10.0
- Owner: https://github.com/Hmbown
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0a76ccc5953c4c738793fbcc1ebdbeae323f626d
- Trigger Event: release

File details

Details for the file zmlx-0.10.0-py3-none-any.whl.

File metadata

Download URL: zmlx-0.10.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 447.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zmlx-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0e1fddf04e7781c4239c88ccc36a7d46efc61ffd61617ea578628baee119a63`
MD5	`96ee0229273e0944264572b820cbf2e9`
BLAKE2b-256	`9106ab0dffaba0a372de8d05dc8f61486bbd1b7a42cae77ba8d2fa9c03c77727`

See more details on using hashes here.

Provenance

The following attestation bundles were made for zmlx-0.10.0-py3-none-any.whl:

Publisher: release.yml on Hmbown/ZMLX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: zmlx-0.10.0-py3-none-any.whl
- Subject digest: e0e1fddf04e7781c4239c88ccc36a7d46efc61ffd61617ea578628baee119a63
- Sigstore transparency entry: 1019795108
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: Hmbown/ZMLX@0a76ccc5953c4c738793fbcc1ebdbeae323f626d
- Branch / Tag: refs/tags/v0.10.0
- Owner: https://github.com/Hmbown
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0a76ccc5953c4c738793fbcc1ebdbeae323f626d
- Trigger Event: release

zmlx 0.10.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ZMLX — Metal kernels and model patching for MLX on Apple Silicon

Measured Results

Stock MLX (works with pip install mlx)

Custom MLX primitive (requires building mlx_local/)

Quick Start

What's Inside

exo Integration

Docs

Contributing / Development

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Stock MLX (works with `pip install mlx`)

Custom MLX primitive (requires building `mlx_local/`)