Static analysis profiler for Metal compute shaders on Apple Silicon
Project description
metal-profiler
Static analysis profiler for Metal compute shaders. Compiles your kernel, extracts the native AGX GPU binary, disassembles it, and tells you exactly where the bottleneck is.
$ python -m metal_profiler.metal_profile kernel.metal -f matmul_naive
╔══════════════════════════════════════════════════════════════╗
║ metal-profiler: matmul_naive ║
╚══════════════════════════════════════════════════════════════╝
── Registers & Occupancy ──
Peak live GPRs: 9
Half-regs: 18 / 256
Occupancy: 100%
[████████████████████████████████████████] (good)
── Loop 0 ──
ALU/iter: 5 cy
Loads/iter: 2
Wait stall: ~195 cy (2 loads before wait)
Total/iter: 201 cy
── Suggestions ──
🔴 2 global loads/iter with ~195cy stall. Tile into threadgroup memory.
🟡 Only 5cy ALU between loads and wait. Unroll or interleave independent work.
No guessing. No Xcode required. Real GPU instructions, real cycle counts.
How it works
- Compile
.metal→.metallib(viaxcrun metal) - Create
MTLBinaryArchive→ triggers Apple's GPU JIT compiler - Extract native AGX machine code from the archive (fat Mach-O → applegpu slice →
__textsection) - Disassemble using applegpu (Dougall Johnson's reverse-engineered ISA)
- Analyze using instruction timing data from Mesa/Asahi (
agx_performance.c)
What it reports
| Analysis | Source |
|---|---|
| Per-instruction cycle cost | Mesa's agx_performance.c timing model |
| 4-unit pipeline breakdown (F32, F16, SCIB, IC) | Mesa's execution unit model |
| Register liveness → occupancy | Linear scan over instruction defs/uses |
| RAW dependency penalties | Metal-benchmarks measured values (+0.84cy FP32, +0.56cy FP16) |
| Memory stall estimation | Scoreboard model: async loads, wait blocks |
| Loop body cost per iteration | Combined ALU + memory + dependency analysis |
| Optimization suggestions | Pattern matching on identified bottlenecks |
Python API
from metal_profiler import profile_metal_file, profile_metal_source
# Profile a .metal file
report, disasm = profile_metal_file("kernel.metal", "my_kernel")
print(report)
# Profile from source string
report, disasm = profile_metal_source(source_code, "my_kernel")
print(report)
# Lower-level access
from metal_profiler import parse_disassembly, analyze, format_report, occupancy_for_regs
instructions = parse_disassembly(disasm)
result = analyze(instructions)
print(f"Occupancy: {result.occupancy_pct}%")
print(f"Bottleneck: {result.bottleneck}")
Requirements
- macOS with Metal (Apple Silicon)
- Python 3.9+
- applegpu — clone it next to this repo:
cd ~/projects git clone https://github.com/dougallj/applegpu.git
Usage
# Profile a kernel
python -m metal_profiler.metal_profile kernel.metal -f my_kernel
# Just disassemble (no analysis)
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --disasm-only
# Show raw disassembly alongside profile
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --show-disasm
# Profile a pre-extracted GPU binary
python -m metal_profiler.metal_profile --binary gpu_code.bin -f my_kernel
Example output (annotated disassembly)
── Annotated Disassembly ──
a8: device_load [MEM 1cy] r6, u0_u1, r8, unsigned ◀ memory
b0: device_load [MEM 1cy] r7, u2_u3, r5, unsigned ◀ RAW dep +1.0cy
b8: wait [ ] 0 ◀◀◀ STALL ~200cy
ba: iadd [SCIB 1cy] r3.cache, 1, r3.discard
c2: fmadd32 [F32 1cy] r1, r7, r6, r1
│ ca: while_icmp [ 1cy] r0l, nseq, r3, u14, 2
Each instruction shows:
- Execution unit (F32/F16/SCIB/IC/MEM)
- Throughput cost in cycles
- Dependency penalties (◀ RAW dep)
- Memory stalls (◀◀◀ STALL)
- Loop depth markers (│)
Suggestions engine
The profiler generates actionable suggestions:
- 🔴 High: Tile global loads into threadgroup memory, reduce register pressure for occupancy
- 🟡 Medium: Break dependency chains, hoist expensive ops out of loops, unroll for latency hiding
- 🟢 Low: Consider FP16 for throughput, minor scheduling improvements
Architecture data sources
This tool stands on the shoulders of:
- Asahi Linux / Mesa — Alyssa Rosenzweig's reverse-engineered AGX compiler, ISA, and performance model
- applegpu — Dougall Johnson's AGX instruction set disassembler and emulator
- metal-benchmarks — Philip Turner's measured instruction latencies and cache hierarchy data
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metal_profiler-0.1.1.tar.gz.
File metadata
- Download URL: metal_profiler-0.1.1.tar.gz
- Upload date:
- Size: 56.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a481b69221e9c3c3cf36431f9ffa083f73cf006e95b327436cc5ce83a7a788fb
|
|
| MD5 |
220816586e62037463519495d815bf20
|
|
| BLAKE2b-256 |
88356b0769de5b9878a31a583c9df835b0107ea23295867016cecccc92701532
|
File details
Details for the file metal_profiler-0.1.1-py3-none-any.whl.
File metadata
- Download URL: metal_profiler-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d165be6f4afda35406bbc4d3dee200845de733451efc3efa7e1cd506ded6a9b
|
|
| MD5 |
f454446eabbde3eed5a81ff1144493a8
|
|
| BLAKE2b-256 |
0b4d6a149af5f51becc98b52f32781cc13f30b54aa46003d7725bb1655638749
|