Static analysis profiler for Metal compute shaders on Apple Silicon

These details have not been verified by PyPI

Project links

Repository

Project description

metal-profiler

Static analysis profiler for Metal compute shaders. Compiles your kernel, extracts the native AGX GPU binary, disassembles it, and tells you exactly where the bottleneck is.

$ python -m metal_profiler.metal_profile kernel.metal -f matmul_naive

╔══════════════════════════════════════════════════════════════╗
║  metal-profiler: matmul_naive                               ║
╚══════════════════════════════════════════════════════════════╝

  ── Registers & Occupancy ──
  Peak live GPRs:  9
  Half-regs:       18 / 256
  Occupancy:       100%
                   [████████████████████████████████████████] (good)

  ── Loop 0 ──
  ALU/iter:        5 cy
  Loads/iter:      2
  Wait stall:      ~195 cy (2 loads before wait)
  Total/iter:      201 cy

  ── Suggestions ──
  🔴 2 global loads/iter with ~195cy stall. Tile into threadgroup memory.
  🟡 Only 5cy ALU between loads and wait. Unroll or interleave independent work.

No guessing. No Xcode required. Real GPU instructions, real cycle counts.

How it works

Compile .metal → .metallib (via xcrun metal)
Create MTLBinaryArchive → triggers Apple's GPU JIT compiler
Extract native AGX machine code from the archive (fat Mach-O → applegpu slice → __text section)
Disassemble using applegpu (Dougall Johnson's reverse-engineered ISA)
Analyze using instruction timing data from Mesa/Asahi (agx_performance.c)

What it reports

Analysis	Source
Per-instruction cycle cost	Mesa's `agx_performance.c` timing model
4-unit pipeline breakdown (F32, F16, SCIB, IC)	Mesa's execution unit model
Register liveness → occupancy	Linear scan over instruction defs/uses
RAW dependency penalties	Metal-benchmarks measured values (+0.84cy FP32, +0.56cy FP16)
Memory stall estimation	Scoreboard model: async loads, wait blocks
Loop body cost per iteration	Combined ALU + memory + dependency analysis
Optimization suggestions	Pattern matching on identified bottlenecks

Python API

from metal_profiler import profile_metal_file, profile_metal_source

# Profile a .metal file
report, disasm = profile_metal_file("kernel.metal", "my_kernel")
print(report)

# Profile from source string
report, disasm = profile_metal_source(source_code, "my_kernel")
print(report)

# Lower-level access
from metal_profiler import parse_disassembly, analyze, format_report, occupancy_for_regs

instructions = parse_disassembly(disasm)
result = analyze(instructions)
print(f"Occupancy: {result.occupancy_pct}%")
print(f"Bottleneck: {result.bottleneck}")

Requirements

macOS with Metal (Apple Silicon)
Python 3.9+

applegpu — clone it next to this repo:

cd ~/projects
git clone https://github.com/dougallj/applegpu.git

Usage

# Profile a kernel
python -m metal_profiler.metal_profile kernel.metal -f my_kernel

# Just disassemble (no analysis)
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --disasm-only

# Show raw disassembly alongside profile
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --show-disasm

# Profile a pre-extracted GPU binary
python -m metal_profiler.metal_profile --binary gpu_code.bin -f my_kernel

Example output (annotated disassembly)

  ── Annotated Disassembly ──
      a8: device_load        [MEM   1cy] r6, u0_u1, r8, unsigned      ◀ memory
      b0: device_load        [MEM   1cy] r7, u2_u3, r5, unsigned      ◀ RAW dep +1.0cy
      b8: wait               [         ] 0                             ◀◀◀ STALL ~200cy
      ba: iadd               [SCIB  1cy] r3.cache, 1, r3.discard
      c2: fmadd32            [F32   1cy] r1, r7, r6, r1
  │   ca: while_icmp         [      1cy] r0l, nseq, r3, u14, 2

Each instruction shows:

Execution unit (F32/F16/SCIB/IC/MEM)
Throughput cost in cycles
Dependency penalties (◀ RAW dep)
Memory stalls (◀◀◀ STALL)
Loop depth markers (│)

Suggestions engine

The profiler generates actionable suggestions:

🔴 High: Tile global loads into threadgroup memory, reduce register pressure for occupancy
🟡 Medium: Break dependency chains, hoist expensive ops out of loops, unroll for latency hiding
🟢 Low: Consider FP16 for throughput, minor scheduling improvements

Architecture data sources

This tool stands on the shoulders of:

Asahi Linux / Mesa — Alyssa Rosenzweig's reverse-engineered AGX compiler, ISA, and performance model
applegpu — Dougall Johnson's AGX instruction set disassembler and emulator
metal-benchmarks — Philip Turner's measured instruction latencies and cache hierarchy data

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.1

Apr 8, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metal_profiler-0.1.1.tar.gz (56.0 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metal_profiler-0.1.1-py3-none-any.whl (56.1 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file metal_profiler-0.1.1.tar.gz.

File metadata

Download URL: metal_profiler-0.1.1.tar.gz
Upload date: Apr 8, 2026
Size: 56.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a481b69221e9c3c3cf36431f9ffa083f73cf006e95b327436cc5ce83a7a788fb`
MD5	`220816586e62037463519495d815bf20`
BLAKE2b-256	`88356b0769de5b9878a31a583c9df835b0107ea23295867016cecccc92701532`

See more details on using hashes here.

File details

Details for the file metal_profiler-0.1.1-py3-none-any.whl.

File metadata

Download URL: metal_profiler-0.1.1-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 56.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d165be6f4afda35406bbc4d3dee200845de733451efc3efa7e1cd506ded6a9b`
MD5	`f454446eabbde3eed5a81ff1144493a8`
BLAKE2b-256	`0b4d6a149af5f51becc98b52f32781cc13f30b54aa46003d7725bb1655638749`

See more details on using hashes here.

metal-profiler 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

metal-profiler

How it works

What it reports

Python API

Requirements

Usage

Example output (annotated disassembly)

Suggestions engine

Architecture data sources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes