Skip to main content

Static analysis profiler for Metal compute shaders on Apple Silicon

Project description

metal-profiler

Static analysis profiler for Metal compute shaders. Compiles your kernel, extracts the native AGX GPU binary, disassembles it, and tells you exactly where the bottleneck is.

$ python -m metal_profiler.metal_profile kernel.metal -f matmul_naive

╔══════════════════════════════════════════════════════════════╗
║  metal-profiler: matmul_naive                               ║
╚══════════════════════════════════════════════════════════════╝

  ── Registers & Occupancy ──
  Peak live GPRs:  9
  Half-regs:       18 / 256
  Occupancy:       100%
                   [████████████████████████████████████████] (good)

  ── Loop 0 ──
  ALU/iter:        5 cy
  Loads/iter:      2
  Wait stall:      ~195 cy (2 loads before wait)
  Total/iter:      201 cy

  ── Suggestions ──
  🔴 2 global loads/iter with ~195cy stall. Tile into threadgroup memory.
  🟡 Only 5cy ALU between loads and wait. Unroll or interleave independent work.

No guessing. No Xcode required. Real GPU instructions, real cycle counts.

How it works

  1. Compile .metal.metallib (via xcrun metal)
  2. Create MTLBinaryArchive → triggers Apple's GPU JIT compiler
  3. Extract native AGX machine code from the archive (fat Mach-O → applegpu slice → __text section)
  4. Disassemble using applegpu (Dougall Johnson's reverse-engineered ISA)
  5. Analyze using instruction timing data from Mesa/Asahi (agx_performance.c)

What it reports

Analysis Source
Per-instruction cycle cost Mesa's agx_performance.c timing model
4-unit pipeline breakdown (F32, F16, SCIB, IC) Mesa's execution unit model
Register liveness → occupancy Linear scan over instruction defs/uses
RAW dependency penalties Metal-benchmarks measured values (+0.84cy FP32, +0.56cy FP16)
Memory stall estimation Scoreboard model: async loads, wait blocks
Loop body cost per iteration Combined ALU + memory + dependency analysis
Optimization suggestions Pattern matching on identified bottlenecks

Python API

from metal_profiler import profile_metal_file, profile_metal_source

# Profile a .metal file
report, disasm = profile_metal_file("kernel.metal", "my_kernel")
print(report)

# Profile from source string
report, disasm = profile_metal_source(source_code, "my_kernel")
print(report)

# Lower-level access
from metal_profiler import parse_disassembly, analyze, format_report, occupancy_for_regs

instructions = parse_disassembly(disasm)
result = analyze(instructions)
print(f"Occupancy: {result.occupancy_pct}%")
print(f"Bottleneck: {result.bottleneck}")

Requirements

  • macOS with Metal (Apple Silicon)
  • Python 3.9+
  • applegpu — clone it next to this repo:
    cd ~/projects
    git clone https://github.com/dougallj/applegpu.git
    

Usage

# Profile a kernel
python -m metal_profiler.metal_profile kernel.metal -f my_kernel

# Just disassemble (no analysis)
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --disasm-only

# Show raw disassembly alongside profile
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --show-disasm

# Profile a pre-extracted GPU binary
python -m metal_profiler.metal_profile --binary gpu_code.bin -f my_kernel

Example output (annotated disassembly)

  ── Annotated Disassembly ──
      a8: device_load        [MEM   1cy] r6, u0_u1, r8, unsigned      ◀ memory
      b0: device_load        [MEM   1cy] r7, u2_u3, r5, unsigned      ◀ RAW dep +1.0cy
      b8: wait               [         ] 0                             ◀◀◀ STALL ~200cy
      ba: iadd               [SCIB  1cy] r3.cache, 1, r3.discard
      c2: fmadd32            [F32   1cy] r1, r7, r6, r1
  │   ca: while_icmp         [      1cy] r0l, nseq, r3, u14, 2

Each instruction shows:

  • Execution unit (F32/F16/SCIB/IC/MEM)
  • Throughput cost in cycles
  • Dependency penalties (◀ RAW dep)
  • Memory stalls (◀◀◀ STALL)
  • Loop depth markers (│)

Suggestions engine

The profiler generates actionable suggestions:

  • 🔴 High: Tile global loads into threadgroup memory, reduce register pressure for occupancy
  • 🟡 Medium: Break dependency chains, hoist expensive ops out of loops, unroll for latency hiding
  • 🟢 Low: Consider FP16 for throughput, minor scheduling improvements

Architecture data sources

This tool stands on the shoulders of:

  • Asahi Linux / Mesa — Alyssa Rosenzweig's reverse-engineered AGX compiler, ISA, and performance model
  • applegpu — Dougall Johnson's AGX instruction set disassembler and emulator
  • metal-benchmarks — Philip Turner's measured instruction latencies and cache hierarchy data

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metal_profiler-0.1.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metal_profiler-0.1.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file metal_profiler-0.1.0.tar.gz.

File metadata

  • Download URL: metal_profiler-0.1.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e4f6537f4047bb45610f890d12d8931cde29ed30438b23448d9f5776e9c5b6b
MD5 1c49fe1ad599d0f5a2cf81e748b69095
BLAKE2b-256 6256f11b8814b4ac80c519b2154d9c87f2b4d8b0bc8e01eca77a4e3bdd83932a

See more details on using hashes here.

File details

Details for the file metal_profiler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: metal_profiler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 876f0e2f50c2c4170e5d684927d8951e048d5a3964665a68d14c8c03fb22174b
MD5 6cbc3f9e440825ac2c676f533bf4d341
BLAKE2b-256 234a712725ec491ec1a955a930e23482069db8e432e1446362648646e2f6ba1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page