Operation-level profiler for Apple Silicon / MLX

These details have not been verified by PyPI

Project links

Repository

Project description

mlx-profiler

Operation-level profiler for Apple Silicon / MLX. The missing torch.profiler equivalent for the MLX ecosystem.

What it gives you

Op-level timing — every matmul, softmax, rms_norm, layer call, etc., measured individually
FLOPs estimation — multiply-add counts for matmuls, convolutions
Arithmetic intensity — FLOPs/byte ratio per op (roofline model X-axis)
Memory bandwidth estimate — tensor traffic through each operation
Device attribution — GPU / CPU / Neural Engine breakdown
Terminal report — human-readable ANSI table
Interactive HTML dashboard — flame timeline, roofline scatter, searchable op table, category donut
Trace persistence — save/load JSON, render HTML from CLI

Install

pip install mlx-profiler          # without MLX (manual recording only)
pip install "mlx-profiler[mlx]"   # with MLX for automatic interception

Quick start

import mlx.core as mx
import mlx.nn as nn
import mlx_profiler as mp

model = MyTransformer()
x = mx.random.normal([1, 512])

with mp.profile("my_model") as prof:
    out = model(x)
    mx.eval(out)

prof.report()              # terminal output
prof.html("report.html")   # interactive dashboard
prof.save("trace.json")    # save for later

CLI

# Generate a demo trace + report
mlx-profiler demo

# View a saved trace in the terminal
mlx-profiler view trace.json

# Render HTML from a saved trace
mlx-profiler html trace.json -o report.html

Manual op recording (no MLX required)

from mlx_profiler.trace import Trace
from mlx_profiler.profiler import op_timer
import mlx_profiler as mp

trace = Trace("my_trace")

with op_timer(trace, "matmul",
              input_shapes=[[512, 4096], [4096, 4096]],
              output_shapes=[[512, 4096]],
              dtype="float16"):
    result = my_matmul(a, b)

mp.print_report(trace)

What's measured

Op	FLOPs formula	Notes
`matmul` / `linear`	2·M·K·N	multiply-add counted as 2 ops
`conv2d`	2·Ho·Wo·Co·Ci·Kh·Kw	per batch element
All ops	input + output tensor bytes	memory bandwidth estimate

Design notes

Why wall-clock, not Metal GPU timers?

Metal's MTLCommandBuffer.GPUStartTime gives you true on-chip execution time but requires you to instrument command buffers — not feasible when wrapping MLX's lazy evaluation model. Wall-clock with a forced mx.eval() boundary gives you "latency as the model experiences it", which is the number that matters for interactive applications. A future metal_backend module can add real GPU timers via Metal's performance counter API once the graph is materialized.

Lazy evaluation and timing

MLX uses lazy evaluation — operations aren't executed until mx.eval() is called. The profiler inserts mx.eval() calls at layer boundaries to force materialization and get accurate per-layer timing. This means profiling adds overhead; real-world timings will be slightly slower than unproduced runs.

Roadmap

Metal GPU timestamp counters (true on-chip time)
Neural Engine vs GPU attribution via os_signpost
Continuous batching aware profiling
Multi-pass aggregation (average over N forward passes)
FlashAttention kernel detection
TurboQuant / quantized op analysis
Comparison mode: profile A vs profile B

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.2.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_profiler-0.2.0.tar.gz (17.8 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_profiler-0.2.0-py3-none-any.whl (17.5 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file mlx_profiler-0.2.0.tar.gz.

File metadata

Download URL: mlx_profiler-0.2.0.tar.gz
Upload date: Apr 2, 2026
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mlx_profiler-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`167c0821ac031c60e3042c71b64a699a81df65701dd20b8f6da99e40a97716ac`
MD5	`68a96bd9a64a791b1407dd6f974c7536`
BLAKE2b-256	`a947127ba394da64156894adf8ea08b7b3c5f46c8dd7f9b6562aceecb0a22f97`

See more details on using hashes here.

File details

Details for the file mlx_profiler-0.2.0-py3-none-any.whl.

File metadata

Download URL: mlx_profiler-0.2.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mlx_profiler-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad8462677ade7225342a4ebc85bc8e76501f11a7806fd7c7958888f14aa85d08`
MD5	`7228a800953d7ba733a03a9db11ff3b5`
BLAKE2b-256	`1657ca67a9a8944f6c4afe847d976b7e8ba5c40a89bb1aa3e10f9efa7aad3c07`

See more details on using hashes here.

mlx-profiler 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-profiler

What it gives you

Install

Quick start

CLI

Manual op recording (no MLX required)

What's measured

Design notes

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes