Python bindings for metal-cpp

These details have not been verified by PyPI

Project links

Project description

PyMetal

Python bindings for Apple's Metal GPU API, enabling high-performance GPU computing and graphics programming from Python.

Overview

PyMetal provides Pythonic access to Apple's Metal API through metal-cpp and [nanobind](https://github.com/wjakob/nanobind, allowing you to:

Write and execute Metal compute shaders from Python
Build complete graphics pipelines with vertex/fragment shaders
Leverage GPU acceleration for custom algorithms
Integrate seamlessly with NumPy for zero-copy data transfer
Access advanced Metal features like events, binary archives, and capture scopes

Why PyMetal?

Direct Metal Access: Full control over GPU resources, not a high-level abstraction
Zero-Copy NumPy Integration: Efficient data transfer between Python and GPU
Complete API Coverage: Compute, graphics, advanced synchronization, and debugging
Educational: Clear examples showing GPU programming concepts
Performant: Properly releases GIL for multithreaded Python applications

Features

Core Capabilities

Phase 1: Compute Pipeline [x]

Device management and command queues
Buffer allocation and management
Shader compilation from Metal Shading Language source
Compute pipeline creation and execution
Thread group configuration and dispatch
Zero-copy NumPy buffer integration

Phase 2: Graphics Pipeline [x]

Core Graphics:
- Texture creation and management
- Render pipeline state with vertex/fragment shaders
- Render pass descriptors with color/depth attachments
- Sampler states for texture filtering
- Offscreen rendering
Advanced Graphics:
- Vertex descriptors and buffer layouts
- Depth/stencil testing
- Blit command encoder for memory operations
- Heap-based resource allocation
- Fence synchronization
- Metal layer integration for display

Phase 3: Advanced Features [x]

Event system for fine-grained synchronization
Shared events for cross-process coordination
Argument buffers for efficient resource binding
Indirect command buffers for GPU-driven rendering
Binary archives for pipeline caching
Capture scopes for Xcode GPU debugging integration

Phase 4: Ray Tracing (Planned)

Ray tracing acceleration structures
Ray tracing pipelines
Intersection function tables
Primitive acceleration structures

Note: Ray tracing support can be added on-demand. Current implementation focuses on compute and rasterization.

Performance Characteristics

PyMetal achieves realistic GPU performance on Apple Silicon:

Operation	Performance	Notes
Image Blur	4-5× speedup	Over SciPy for large images (1024×1024+)
Matrix Multiply (Naive)	~100 GFLOPS	Educational baseline
Matrix Multiply (Optimized)	~220 GFLOPS	With tiling and optimizations
Graphics Rendering	Full speed	Complete pipeline with depth testing

Note: NumPy/SciPy may be faster for standard operations due to Apple's Accelerate framework and AMX coprocessor. PyMetal excels at custom algorithms where specialized hardware doesn't exist.

Installation

Requirements

macOS 11.0+ (Big Sur or later)
Python 3.12+
Xcode Command Line Tools
Metal-compatible GPU (all modern Macs)

Install from Source

git clone https://github.com/shakfu/pymetal-cpp.git
cd pymetal-cpp
pip install -e .

Dependencies

PyMetal automatically installs:

nanobind - C++/Python bindings
numpy - Array operations

For examples, you may also want:

pip install scipy  # For image blur example

Quick Start

Hello GPU: Vector Addition

import numpy as np
import pymetal as pm

# Initialize device
device = pm.create_system_default_device()
queue = device.new_command_queue()

# Create data
size = 1024
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)

# Compile shader
shader = """
#include <metal_stdlib>
using namespace metal;

kernel void vector_add(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float* c [[buffer(2)]],
    uint id [[thread_position_in_grid]])
{
    c[id] = a[id] + b[id];
}
"""

library = device.new_library_with_source(shader)
function = library.new_function("vector_add")
pipeline = device.new_compute_pipeline_state(function)

# Create GPU buffers
a_buffer = device.new_buffer(a.nbytes, pm.ResourceStorageModeShared)
b_buffer = device.new_buffer(b.nbytes, pm.ResourceStorageModeShared)
c_buffer = device.new_buffer(a.nbytes, pm.ResourceStorageModeShared)

# Upload data (zero-copy)
np.copyto(np.frombuffer(a_buffer.contents(), dtype=np.float32), a)
np.copyto(np.frombuffer(b_buffer.contents(), dtype=np.float32), b)

# Execute on GPU
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.compute_command_encoder()
encoder.set_compute_pipeline_state(pipeline)
encoder.set_buffer(a_buffer, 0, 0)
encoder.set_buffer(b_buffer, 0, 1)
encoder.set_buffer(c_buffer, 0, 2)
encoder.dispatch_threadgroups(16, 1, 1, 64, 1, 1)
encoder.end_encoding()
cmd_buffer.commit()
cmd_buffer.wait_until_completed()

# Read result
result = np.frombuffer(c_buffer.contents(), dtype=np.float32, count=size)
print(f"First 5 results: {result[:5]}")

Graphics: Render a Triangle

import pymetal as pm

device = pm.create_system_default_device()
queue = device.new_command_queue()

# Create render target
width, height = 512, 512
color_desc = pm.TextureDescriptor.texture2d_descriptor(
    pm.PixelFormat.RGBA8Unorm, width, height, False
)
color_texture = device.new_texture(color_desc)

# Vertex and fragment shaders
shader = """
#include <metal_stdlib>
using namespace metal;

struct VertexOut {
    float4 position [[position]];
    float4 color;
};

vertex VertexOut vertex_main(uint vertex_id [[vertex_id]]) {
    float2 positions[3] = {
        float2( 0.0,  0.7),
        float2(-0.7, -0.7),
        float2( 0.7, -0.7)
    };
    float4 colors[3] = {
        float4(1.0, 0.0, 0.0, 1.0),  // Red
        float4(0.0, 1.0, 0.0, 1.0),  // Green
        float4(0.0, 0.0, 1.0, 1.0)   // Blue
    };
    VertexOut out;
    out.position = float4(positions[vertex_id], 0.0, 1.0);
    out.color = colors[vertex_id];
    return out;
}

fragment float4 fragment_main(VertexOut in [[stage_in]]) {
    return in.color;
}
"""

library = device.new_library_with_source(shader)
vertex_func = library.new_function("vertex_main")
fragment_func = library.new_function("fragment_main")

# Create render pipeline
pipeline_desc = pm.RenderPipelineDescriptor.render_pipeline_descriptor()
pipeline_desc.vertex_function = vertex_func
pipeline_desc.fragment_function = fragment_func
pipeline_desc.color_attachment(0).pixel_format = pm.PixelFormat.RGBA8Unorm
pipeline = device.new_render_pipeline_state(pipeline_desc)

# Configure render pass
render_pass = pm.RenderPassDescriptor.render_pass_descriptor()
color_att = render_pass.color_attachment(0)
color_att.texture = color_texture
color_att.load_action = pm.LoadAction.Clear
color_att.store_action = pm.StoreAction.Store
color_att.clear_color = pm.ClearColor(0.0, 0.0, 0.0, 1.0)

# Render
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.render_command_encoder(render_pass)
encoder.set_render_pipeline_state(pipeline)
encoder.draw_primitives(pm.PrimitiveType.Triangle, 0, 3)
encoder.end_encoding()
cmd_buffer.commit()
cmd_buffer.wait_until_completed()

API Guide

Device Management

# Get default GPU
device = pm.create_system_default_device()

# Device properties
print(device.name)
print(device.max_threads_per_threadgroup)

Memory Management

# Storage modes
pm.ResourceStorageModeShared      # CPU and GPU accessible
pm.ResourceStorageModePrivate     # GPU only (fastest)
pm.ResourceStorageModeManaged     # Explicit sync required
pm.ResourceStorageModeMemoryless  # Tile memory only

# Create buffer
buffer = device.new_buffer(size_in_bytes, pm.ResourceStorageModeShared)

# Access buffer from Python (zero-copy)
buffer_view = np.frombuffer(buffer.contents(), dtype=np.float32)

# Create texture
tex_desc = pm.TextureDescriptor.texture2d_descriptor(
    pm.PixelFormat.RGBA8Unorm,
    width,
    height,
    mipmapped=False
)
texture = device.new_texture(tex_desc)

Shader Compilation

# Compile from source
library = device.new_library_with_source(shader_source_string)
function = library.new_function("kernel_name")

# Create compute pipeline
compute_pipeline = device.new_compute_pipeline_state(function)

# Create graphics pipeline
render_desc = pm.RenderPipelineDescriptor.render_pipeline_descriptor()
render_desc.vertex_function = vertex_function
render_desc.fragment_function = fragment_function
render_pipeline = device.new_render_pipeline_state(render_desc)

Command Execution

# Create command queue (once)
queue = device.new_command_queue()

# Execute commands
cmd_buffer = queue.command_buffer()

# For compute:
encoder = cmd_buffer.compute_command_encoder()
encoder.set_compute_pipeline_state(pipeline)
encoder.set_buffer(buffer, offset, index)
encoder.dispatch_threadgroups(
    grid_w, grid_h, grid_d,      # Number of threadgroups
    threads_w, threads_h, threads_d  # Threads per group
)
encoder.end_encoding()

# For graphics:
encoder = cmd_buffer.render_command_encoder(render_pass)
encoder.set_render_pipeline_state(pipeline)
encoder.draw_primitives(pm.PrimitiveType.Triangle, 0, vertex_count)
encoder.end_encoding()

# Submit and wait
cmd_buffer.commit()
cmd_buffer.wait_until_completed()  # Blocks (GIL is released)

Thread Group Configuration

# Compute thread groups
threads_per_group = 256  # Must be ≤ max_threads_per_threadgroup
num_elements = 100000
num_groups = (num_elements + threads_per_group - 1) // threads_per_group

encoder.dispatch_threadgroups(
    num_groups, 1, 1,        # Grid size
    threads_per_group, 1, 1  # Threads per group
)

# 2D/3D grids
grid_w = (width + 16 - 1) // 16
grid_h = (height + 16 - 1) // 16
encoder.dispatch_threadgroups(
    grid_w, grid_h, 1,
    16, 16, 1  # 16×16 thread groups
)

Synchronization

# Simple: wait for completion
cmd_buffer.wait_until_completed()

# Advanced: use fences
fence = device.new_fence()
encoder.update_fence(fence)
# ... later ...
encoder.wait_for_fence(fence)

# Events (Phase 3)
event = device.new_event()
shared_event = device.new_shared_event()
shared_event.signaled_value = 42

Debugging

# Enable Metal validation
import os
os.environ['METAL_DEVICE_WRAPPER_TYPE'] = '1'
os.environ['MTL_DEBUG_LAYER'] = '1'

# Use capture scopes with Xcode
manager = pm.shared_capture_manager()
scope = manager.new_capture_scope_with_command_queue(queue)
scope.label = "My Debug Capture"
scope.begin_scope()
# ... GPU work ...
scope.end_scope()
# Capture in Xcode: Product > Perform Action > Capture GPU Frame

# Add labels for debugging
buffer.label = "Input Data"
cmd_buffer.label = "Main Rendering Pass"

Examples

See examples/README.md for detailed examples:

01_image_blur.py - Gaussian blur compute shader
02_matrix_multiply_naive.py - Simple matrix multiplication (educational)
02_matrix_multiply_tiled.py - Optimized with shared memory tiling
02_matrix_multiply_optimized.py - Advanced optimizations
03_triangle_rendering.py - Complete graphics pipeline
04_advanced_features.py - Events, capture scopes, and more

Run any example:

python examples/01_image_blur.py

When to Use PyMetal vs Alternatives

Use PyMetal When

You need custom GPU algorithms not available in libraries
You want full control over GPU resources
You're doing image processing, simulations, or custom compute
You need to fuse operations for efficiency
You want to learn GPU programming on Apple Silicon
You need rasterization or compute pipelines (ray tracing coming in Phase 4)

Use NumPy/SciPy When

Standard operations (matrix multiply, FFT, convolution)
Prototyping and development speed matters
Small datasets where GPU overhead dominates
Apple's Accelerate framework provides optimizations

Hybrid Approach

Most applications use both:

NumPy for standard linear algebra
PyMetal for custom kernels and GPU-specific operations
Example: NumPy for matrix ops, PyMetal for custom activation functions

Performance Tips

Use Shared Storage Mode for CPU-GPU data transfer
Batch operations - submit multiple dispatches per command buffer
Optimize thread group size - typically 64-256 threads per group
Use shared/threadgroup memory for data reuse
Profile with Instruments - Xcode's GPU profiling tools work great
Release GIL - PyMetal properly releases GIL during blocking operations

Project Structure

pymetal-cpp/
├── src/
│   ├── _pymetal.cpp           # Main C++ bindings
│   └── pymetal/
│       └── __init__.py        # Python module exports
├── examples/                  # 6 practical examples
│   ├── 01_image_blur.py
│   ├── 02_matrix_multiply_*.py
│   ├── 03_triangle_rendering.py
│   └── 04_advanced_features.py
├── tests/                     # 41 unit tests
│   ├── test_phase1_compute.py
│   ├── test_phase2_graphics.py
│   ├── test_phase2_advanced.py
│   └── test_phase3_advanced.py
├── thirdparty/
│   └── metal-cpp/             # Apple's Metal C++ headers
├── CMakeLists.txt             # Build configuration
├── pyproject.toml             # Python package metadata
└── README.md                  # This file

Testing

Run the test suite:

make test
# or
pytest

All 41 tests cover:

Device and buffer management
Compute pipeline execution
Graphics pipeline rendering
Advanced features (events, capture scopes, etc.)
Memory management and synchronization

Roadmap / Future Work

Potential Phase 4 Features (On-Demand)

Ray Tracing Support:

Acceleration structure creation and management
Ray tracing pipeline descriptors
Intersection function tables
Ray/primitive intersection queries

Additional Features:

Resource heaps with placement
Sparse textures
Indirect argument buffers
Metal Performance Shaders (MPS) integration
Async compute and graphics overlap
Multi-GPU support

Tooling:

Shader debugging utilities
Performance profiling helpers
Memory leak detection
Automatic optimization suggestions

Language Bindings:

Type stubs for better IDE support
Documentation generator from C++ comments
Additional high-level abstractions

These features can be implemented as needed. Contributions welcome!

Contributing

Contributions welcome! Areas of interest:

Ray tracing support (most requested)
Additional examples and tutorials
Performance optimizations
API coverage improvements
Documentation enhancements
Bug fixes and testing

Acknowledgments

Built on Apple's metal-cpp
Uses nanobind for Python bindings
Inspired by the need for low-level GPU access from Python on macOS
Claude Code from Anthropic

Resources

Support

Issues: Open an issue on GitHub
Examples: See examples/ directory
Tests: See tests/ directory for API usage patterns

Note: PyMetal is designed for educational and research purposes. For production graphics applications, consider using established game engines or frameworks.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Dec 17, 2025

This version

0.1.3

Dec 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymetal_cpp-0.1.3.tar.gz (267.6 kB view details)

Uploaded Dec 17, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymetal_cpp-0.1.3-cp312-abi3-macosx_15_0_arm64.whl (165.7 kB view details)

Uploaded Dec 17, 2025 CPython 3.12+macOS 15.0+ ARM64

pymetal_cpp-0.1.3-cp311-cp311-macosx_15_0_arm64.whl (166.8 kB view details)

Uploaded Dec 17, 2025 CPython 3.11macOS 15.0+ ARM64

pymetal_cpp-0.1.3-cp310-cp310-macosx_15_0_arm64.whl (166.5 kB view details)

Uploaded Dec 17, 2025 CPython 3.10macOS 15.0+ ARM64

pymetal_cpp-0.1.3-cp39-cp39-macosx_15_0_arm64.whl (166.7 kB view details)

Uploaded Dec 17, 2025 CPython 3.9macOS 15.0+ ARM64

File details

Details for the file pymetal_cpp-0.1.3.tar.gz.

File metadata

Download URL: pymetal_cpp-0.1.3.tar.gz
Upload date: Dec 17, 2025
Size: 267.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pymetal_cpp-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`3c362ca8538e38dd35f23072c1c320eaad1f9dfe0b4a874c62e5efaab85c49ff`
MD5	`bb11790a656f844734221c18cf40cda0`
BLAKE2b-256	`f4c443ec6a0c60ac13d90d468f35c6a312b0d26f32b6e3a319b5c997dc771757`

See more details on using hashes here.

File details

Details for the file pymetal_cpp-0.1.3-cp312-abi3-macosx_15_0_arm64.whl.

File metadata

Download URL: pymetal_cpp-0.1.3-cp312-abi3-macosx_15_0_arm64.whl
Upload date: Dec 17, 2025
Size: 165.7 kB
Tags: CPython 3.12+, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pymetal_cpp-0.1.3-cp312-abi3-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`eb82a971cadf39dd4464b6b0ba8d751053d11abc89e96d9241d2f3a78b76a351`
MD5	`3d511987c441f0bdaec9e4117434c723`
BLAKE2b-256	`5d37a22d4918f70212b83db4dc9134ab53e9c011b148890c3fa9fe95f6de122b`

See more details on using hashes here.

File details

Details for the file pymetal_cpp-0.1.3-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

Download URL: pymetal_cpp-0.1.3-cp311-cp311-macosx_15_0_arm64.whl
Upload date: Dec 17, 2025
Size: 166.8 kB
Tags: CPython 3.11, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pymetal_cpp-0.1.3-cp311-cp311-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`c5ee8a8345a4ac483864a38bee9af0256a1db32a527be0a500610c08a274b95f`
MD5	`c7270e80f5fb6375360308a309ed9f57`
BLAKE2b-256	`354338a6c9b415f668613066080cb69a9c534423a55222ce4ce370cdf150c21f`

See more details on using hashes here.

File details

Details for the file pymetal_cpp-0.1.3-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

Download URL: pymetal_cpp-0.1.3-cp310-cp310-macosx_15_0_arm64.whl
Upload date: Dec 17, 2025
Size: 166.5 kB
Tags: CPython 3.10, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pymetal_cpp-0.1.3-cp310-cp310-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`a4fe9e9db4c750b1f3375405d1fe34d7cd95f97379f41fd518f88fd2e00b8026`
MD5	`a5896a347bdb025f11d0406a546642df`
BLAKE2b-256	`188d224c16604b654c000ba9f0718c7b6116d89a1f00762b716845c3332594c2`

See more details on using hashes here.

File details

Details for the file pymetal_cpp-0.1.3-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

Download URL: pymetal_cpp-0.1.3-cp39-cp39-macosx_15_0_arm64.whl
Upload date: Dec 17, 2025
Size: 166.7 kB
Tags: CPython 3.9, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pymetal_cpp-0.1.3-cp39-cp39-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`c498723735947d4295ad47ca34a6c65f9dfbf57abd5568617b21fcea6e54985d`
MD5	`30fc7e384f928d92d9dccc1d771de6aa`
BLAKE2b-256	`9c17fe13d4da0cbf982ea59a78a6f37977a990b6c024b1f6d0c5821bd9ec7830`

See more details on using hashes here.

pymetal-cpp 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyMetal

Overview

Features

Core Capabilities

Phase 1: Compute Pipeline [x]

Phase 2: Graphics Pipeline [x]

Phase 3: Advanced Features [x]

Phase 4: Ray Tracing (Planned)

Performance Characteristics

Installation

Requirements

Install from Source

Dependencies

Quick Start

Hello GPU: Vector Addition

Graphics: Render a Triangle

API Guide

Device Management

Memory Management

Shader Compilation

Command Execution

Thread Group Configuration

Synchronization

Debugging

Examples

When to Use PyMetal vs Alternatives

Use PyMetal When

Use NumPy/SciPy When

Hybrid Approach

Performance Tips

Project Structure

Testing

Roadmap / Future Work

Potential Phase 4 Features (On-Demand)

Contributing

Acknowledgments

Resources

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes