Skip to main content

Static analysis profiler for Metal compute shaders on Apple Silicon

Project description

metal-profiler

Static analysis profiler for Metal compute shaders. Compiles your kernel, extracts the native AGX GPU binary, disassembles it, and tells you exactly where the bottleneck is.

$ python -m metal_profiler.metal_profile kernel.metal -f matmul_naive

╔══════════════════════════════════════════════════════════════╗
║  metal-profiler: matmul_naive                               ║
╚══════════════════════════════════════════════════════════════╝

  ── Registers & Occupancy ──
  Peak live GPRs:  9
  Half-regs:       18 / 256
  Occupancy:       100%
                   [████████████████████████████████████████] (good)

  ── Loop 0 ──
  ALU/iter:        5 cy
  Loads/iter:      2
  Wait stall:      ~195 cy (2 loads before wait)
  Total/iter:      201 cy

  ── Suggestions ──
  🔴 2 global loads/iter with ~195cy stall. Tile into threadgroup memory.
  🟡 Only 5cy ALU between loads and wait. Unroll or interleave independent work.

No guessing. No Xcode required. Real GPU instructions, real cycle counts.

How it works

  1. Compile .metal.metallib (via xcrun metal)
  2. Create MTLBinaryArchive → triggers Apple's GPU JIT compiler
  3. Extract native AGX machine code from the archive (fat Mach-O → applegpu slice → __text section)
  4. Disassemble using applegpu (Dougall Johnson's reverse-engineered ISA)
  5. Analyze using instruction timing data from Mesa/Asahi (agx_performance.c)

What it reports

Analysis Source
Per-instruction cycle cost Mesa's agx_performance.c timing model
4-unit pipeline breakdown (F32, F16, SCIB, IC) Mesa's execution unit model
Register liveness → occupancy Linear scan over instruction defs/uses
RAW dependency penalties Metal-benchmarks measured values (+0.84cy FP32, +0.56cy FP16)
Memory stall estimation Scoreboard model: async loads, wait blocks
Loop body cost per iteration Combined ALU + memory + dependency analysis
Optimization suggestions Pattern matching on identified bottlenecks

Python API

from metal_profiler import profile_metal_file, profile_metal_source

# Profile a .metal file
report, disasm = profile_metal_file("kernel.metal", "my_kernel")
print(report)

# Profile from source string
report, disasm = profile_metal_source(source_code, "my_kernel")
print(report)

# Lower-level access
from metal_profiler import parse_disassembly, analyze, format_report, occupancy_for_regs

instructions = parse_disassembly(disasm)
result = analyze(instructions)
print(f"Occupancy: {result.occupancy_pct}%")
print(f"Bottleneck: {result.bottleneck}")

Requirements

  • macOS with Metal (Apple Silicon)
  • Python 3.9+
  • applegpu — clone it next to this repo:
    cd ~/projects
    git clone https://github.com/dougallj/applegpu.git
    

Usage

# Profile a kernel
python -m metal_profiler.metal_profile kernel.metal -f my_kernel

# Just disassemble (no analysis)
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --disasm-only

# Show raw disassembly alongside profile
python -m metal_profiler.metal_profile kernel.metal -f my_kernel --show-disasm

# Profile a pre-extracted GPU binary
python -m metal_profiler.metal_profile --binary gpu_code.bin -f my_kernel

Example output (annotated disassembly)

  ── Annotated Disassembly ──
      a8: device_load        [MEM   1cy] r6, u0_u1, r8, unsigned      ◀ memory
      b0: device_load        [MEM   1cy] r7, u2_u3, r5, unsigned      ◀ RAW dep +1.0cy
      b8: wait               [         ] 0                             ◀◀◀ STALL ~200cy
      ba: iadd               [SCIB  1cy] r3.cache, 1, r3.discard
      c2: fmadd32            [F32   1cy] r1, r7, r6, r1
  │   ca: while_icmp         [      1cy] r0l, nseq, r3, u14, 2

Each instruction shows:

  • Execution unit (F32/F16/SCIB/IC/MEM)
  • Throughput cost in cycles
  • Dependency penalties (◀ RAW dep)
  • Memory stalls (◀◀◀ STALL)
  • Loop depth markers (│)

Suggestions engine

The profiler generates actionable suggestions:

  • 🔴 High: Tile global loads into threadgroup memory, reduce register pressure for occupancy
  • 🟡 Medium: Break dependency chains, hoist expensive ops out of loops, unroll for latency hiding
  • 🟢 Low: Consider FP16 for throughput, minor scheduling improvements

Architecture data sources

This tool stands on the shoulders of:

  • Asahi Linux / Mesa — Alyssa Rosenzweig's reverse-engineered AGX compiler, ISA, and performance model
  • applegpu — Dougall Johnson's AGX instruction set disassembler and emulator
  • metal-benchmarks — Philip Turner's measured instruction latencies and cache hierarchy data

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metal_profiler-0.1.1.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metal_profiler-0.1.1-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file metal_profiler-0.1.1.tar.gz.

File metadata

  • Download URL: metal_profiler-0.1.1.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a481b69221e9c3c3cf36431f9ffa083f73cf006e95b327436cc5ce83a7a788fb
MD5 220816586e62037463519495d815bf20
BLAKE2b-256 88356b0769de5b9878a31a583c9df835b0107ea23295867016cecccc92701532

See more details on using hashes here.

File details

Details for the file metal_profiler-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: metal_profiler-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 56.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metal_profiler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d165be6f4afda35406bbc4d3dee200845de733451efc3efa7e1cd506ded6a9b
MD5 f454446eabbde3eed5a81ff1144493a8
BLAKE2b-256 0b4d6a149af5f51becc98b52f32781cc13f30b54aa46003d7725bb1655638749

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page