Python bindings for metal-cpp
Project description
PyMetal
Python bindings for Apple's Metal GPU API, enabling high-performance GPU computing and graphics programming from Python.
Overview
PyMetal provides Pythonic access to Apple's Metal API through metal-cpp and [nanobind](https://github.com/wjakob/nanobind, allowing you to:
- Write and execute Metal compute shaders from Python
- Build complete graphics pipelines with vertex/fragment shaders
- Leverage GPU acceleration for custom algorithms
- Integrate seamlessly with NumPy for zero-copy data transfer
- Access advanced Metal features like events, binary archives, and capture scopes
Why PyMetal?
- Direct Metal Access: Full control over GPU resources, not a high-level abstraction
- Zero-Copy NumPy Integration: Efficient data transfer between Python and GPU
- Complete API Coverage: Compute, graphics, advanced synchronization, and debugging
- Multi-Device Support: Enumerate and select from all available GPUs
- Shader Preprocessing:
#include,#define, templates for shader development - Educational: Clear examples showing GPU programming concepts
- Performant: Properly releases GIL for multithreaded Python applications
- Type Hints: Full
.pyistub file for IDE support
Features
Core Capabilities
Compute Pipeline
- Device management and command queues
- Buffer allocation and management
- Shader compilation from Metal Shading Language source
- Compute pipeline creation and execution
- Thread group configuration and dispatch
- Zero-copy NumPy buffer integration
Graphics Pipeline
-
Core Graphics:
- Texture creation and management
- Render pipeline state with vertex/fragment shaders
- Render pass descriptors with color/depth attachments
- Sampler states for texture filtering
- Offscreen rendering
-
Advanced Graphics:
- Vertex descriptors and buffer layouts
- Depth/stencil testing
- Blit command encoder for memory operations
- Heap-based resource allocation
- Fence synchronization
- Metal layer integration for display
Advanced Features
- Event system for fine-grained synchronization
- Shared events for cross-process coordination
- Argument buffers for efficient resource binding
- Indirect command buffers for GPU-driven rendering
- Binary archives for pipeline caching
- Capture scopes for Xcode GPU debugging integration
Note: ray tracing support may be added in the future.
Performance Characteristics
PyMetal achieves realistic GPU performance on Apple Silicon:
| Operation | Performance | Notes |
|---|---|---|
| Image Blur | 4-5× speedup | Over SciPy for large images (1024×1024+) |
| Matrix Multiply (Naive) | ~100 GFLOPS | Educational baseline |
| Matrix Multiply (Optimized) | ~220 GFLOPS | With tiling and optimizations |
| Graphics Rendering | Full speed | Complete pipeline with depth testing |
Note: NumPy/SciPy may be faster for standard operations due to Apple's Accelerate framework and AMX coprocessor. PyMetal excels at custom algorithms where specialized hardware doesn't exist.
Installation
Requirements
- macOS 11.0+ (Big Sur or later)
- Python 3.9+
- Xcode Command Line Tools
- Metal-compatible GPU (all modern Macs)
Install from pypi
pip install pymetal-cpp
Install from Source
git clone https://github.com/shakfu/pymetal-cpp.git
cd pymetal-cpp
pip install -e .
Soft Dependencies
pymetal-cpp recommends to install:
numpy- Array operationsscipy- For image blur example
Quick Start
Hello GPU: Vector Addition
import numpy as np
import pymetal as pm
# Initialize device
device = pm.create_system_default_device()
queue = device.new_command_queue()
# Create data
size = 1024
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
# Compile shader
shader = """
#include <metal_stdlib>
using namespace metal;
kernel void vector_add(
device const float* a [[buffer(0)]],
device const float* b [[buffer(1)]],
device float* c [[buffer(2)]],
uint id [[thread_position_in_grid]])
{
c[id] = a[id] + b[id];
}
"""
library = device.new_library_with_source(shader)
function = library.new_function("vector_add")
pipeline = device.new_compute_pipeline_state(function)
# Create GPU buffers
a_buffer = device.new_buffer(a.nbytes, pm.ResourceStorageModeShared)
b_buffer = device.new_buffer(b.nbytes, pm.ResourceStorageModeShared)
c_buffer = device.new_buffer(a.nbytes, pm.ResourceStorageModeShared)
# Upload data (zero-copy)
np.copyto(np.frombuffer(a_buffer.contents(), dtype=np.float32), a)
np.copyto(np.frombuffer(b_buffer.contents(), dtype=np.float32), b)
# Execute on GPU
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.compute_command_encoder()
encoder.set_compute_pipeline_state(pipeline)
encoder.set_buffer(a_buffer, 0, 0)
encoder.set_buffer(b_buffer, 0, 1)
encoder.set_buffer(c_buffer, 0, 2)
encoder.dispatch_threadgroups(16, 1, 1, 64, 1, 1)
encoder.end_encoding()
cmd_buffer.commit()
cmd_buffer.wait_until_completed()
# Read result
result = np.frombuffer(c_buffer.contents(), dtype=np.float32, count=size)
print(f"First 5 results: {result[:5]}")
Graphics: Render a Triangle
import pymetal as pm
device = pm.create_system_default_device()
queue = device.new_command_queue()
# Create render target
width, height = 512, 512
color_desc = pm.TextureDescriptor.texture2d_descriptor(
pm.PixelFormat.RGBA8Unorm, width, height, False
)
color_texture = device.new_texture(color_desc)
# Vertex and fragment shaders
shader = """
#include <metal_stdlib>
using namespace metal;
struct VertexOut {
float4 position [[position]];
float4 color;
};
vertex VertexOut vertex_main(uint vertex_id [[vertex_id]]) {
float2 positions[3] = {
float2( 0.0, 0.7),
float2(-0.7, -0.7),
float2( 0.7, -0.7)
};
float4 colors[3] = {
float4(1.0, 0.0, 0.0, 1.0), // Red
float4(0.0, 1.0, 0.0, 1.0), // Green
float4(0.0, 0.0, 1.0, 1.0) // Blue
};
VertexOut out;
out.position = float4(positions[vertex_id], 0.0, 1.0);
out.color = colors[vertex_id];
return out;
}
fragment float4 fragment_main(VertexOut in [[stage_in]]) {
return in.color;
}
"""
library = device.new_library_with_source(shader)
vertex_func = library.new_function("vertex_main")
fragment_func = library.new_function("fragment_main")
# Create render pipeline
pipeline_desc = pm.RenderPipelineDescriptor.render_pipeline_descriptor()
pipeline_desc.vertex_function = vertex_func
pipeline_desc.fragment_function = fragment_func
pipeline_desc.color_attachment(0).pixel_format = pm.PixelFormat.RGBA8Unorm
pipeline = device.new_render_pipeline_state(pipeline_desc)
# Configure render pass
render_pass = pm.RenderPassDescriptor.render_pass_descriptor()
color_att = render_pass.color_attachment(0)
color_att.texture = color_texture
color_att.load_action = pm.LoadAction.Clear
color_att.store_action = pm.StoreAction.Store
color_att.clear_color = pm.ClearColor(0.0, 0.0, 0.0, 1.0)
# Render
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.render_command_encoder(render_pass)
encoder.set_render_pipeline_state(pipeline)
encoder.draw_primitives(pm.PrimitiveType.Triangle, 0, 3)
encoder.end_encoding()
cmd_buffer.commit()
cmd_buffer.wait_until_completed()
API Guide
Device Management
# Get default GPU
device = pm.create_system_default_device()
# Device properties
print(device.name)
print(device.max_threads_per_threadgroup)
# Enumerate all GPUs (multi-device support)
devices = pm.copy_all_devices()
for d in devices:
print(f"{d.name}: low_power={d.is_low_power}, unified={d.has_unified_memory}")
# Select a specific GPU (e.g., discrete GPU for heavy workloads)
discrete_gpus = [d for d in devices if not d.is_low_power]
device = discrete_gpus[0] if discrete_gpus else devices[0]
Memory Management
# Storage modes
pm.ResourceStorageModeShared # CPU and GPU accessible
pm.ResourceStorageModePrivate # GPU only (fastest)
pm.ResourceStorageModeManaged # Explicit sync required
pm.ResourceStorageModeMemoryless # Tile memory only
# Create buffer
buffer = device.new_buffer(size_in_bytes, pm.ResourceStorageModeShared)
# Access buffer from Python (zero-copy)
buffer_view = np.frombuffer(buffer.contents(), dtype=np.float32)
# Create texture
tex_desc = pm.TextureDescriptor.texture2d_descriptor(
pm.PixelFormat.RGBA8Unorm,
width,
height,
mipmapped=False
)
texture = device.new_texture(tex_desc)
Shader Compilation
# Compile from source
library = device.new_library_with_source(shader_source_string)
function = library.new_function("kernel_name")
# Create compute pipeline
compute_pipeline = device.new_compute_pipeline_state(function)
# Create graphics pipeline
render_desc = pm.RenderPipelineDescriptor.render_pipeline_descriptor()
render_desc.vertex_function = vertex_function
render_desc.fragment_function = fragment_function
render_pipeline = device.new_render_pipeline_state(render_desc)
Shader Preprocessing
from pymetal.shader import ShaderPreprocessor, ShaderTemplate, create_compute_kernel
# Preprocessor with #define and #include support
preprocessor = ShaderPreprocessor()
preprocessor.add_include_path("./shaders")
preprocessor.define("BLOCK_SIZE", "256")
preprocessor.define("USE_FAST_MATH")
source = preprocessor.process('''
#include "common.metal"
#ifdef USE_FAST_MATH
// Fast math enabled
#endif
kernel void my_kernel(...) {
int size = BLOCK_SIZE; // Becomes 256
}
''')
# Templates for parameterized shaders
template = ShaderTemplate('''
kernel void {name}(device {dtype}* data [[buffer(0)]],
uint idx [[thread_position_in_grid]]) {{
data[idx] = data[idx] {operation};
}}
''')
source = template.render(name="double_values", dtype="float", operation="* 2.0")
# Quick kernel generation helper
source = create_compute_kernel(
name="vector_add",
body="c[idx] = a[idx] + b[idx];",
buffers=[("a", "float", "read"), ("b", "float", "read"), ("c", "float", "write")]
)
Command Execution
# Create command queue (once)
queue = device.new_command_queue()
# Execute commands
cmd_buffer = queue.command_buffer()
# For compute:
encoder = cmd_buffer.compute_command_encoder()
encoder.set_compute_pipeline_state(pipeline)
encoder.set_buffer(buffer, offset, index)
encoder.dispatch_threadgroups(
grid_w, grid_h, grid_d, # Number of threadgroups
threads_w, threads_h, threads_d # Threads per group
)
encoder.end_encoding()
# For graphics:
encoder = cmd_buffer.render_command_encoder(render_pass)
encoder.set_render_pipeline_state(pipeline)
encoder.draw_primitives(pm.PrimitiveType.Triangle, 0, vertex_count)
encoder.end_encoding()
# Submit and wait
cmd_buffer.commit()
cmd_buffer.wait_until_completed() # Blocks (GIL is released)
Thread Group Configuration
# Compute thread groups
threads_per_group = 256 # Must be ≤ max_threads_per_threadgroup
num_elements = 100000
num_groups = (num_elements + threads_per_group - 1) // threads_per_group
encoder.dispatch_threadgroups(
num_groups, 1, 1, # Grid size
threads_per_group, 1, 1 # Threads per group
)
# 2D/3D grids
grid_w = (width + 16 - 1) // 16
grid_h = (height + 16 - 1) // 16
encoder.dispatch_threadgroups(
grid_w, grid_h, 1,
16, 16, 1 # 16×16 thread groups
)
Synchronization
# Simple: wait for completion
cmd_buffer.wait_until_completed()
# Advanced: use fences
fence = device.new_fence()
encoder.update_fence(fence)
# ... later ...
encoder.wait_for_fence(fence)
# Events (Phase 3)
event = device.new_event()
shared_event = device.new_shared_event()
shared_event.signaled_value = 42
Debugging
# Enable Metal validation
import os
os.environ['METAL_DEVICE_WRAPPER_TYPE'] = '1'
os.environ['MTL_DEBUG_LAYER'] = '1'
# Use capture scopes with Xcode
manager = pm.shared_capture_manager()
scope = manager.new_capture_scope_with_command_queue(queue)
scope.label = "My Debug Capture"
scope.begin_scope()
# ... GPU work ...
scope.end_scope()
# Capture in Xcode: Product > Perform Action > Capture GPU Frame
# Add labels for debugging
buffer.label = "Input Data"
cmd_buffer.label = "Main Rendering Pass"
Examples
See examples/README.md for detailed examples:
- 01_image_blur.py - Gaussian blur compute shader
- 02_matrix_multiply_naive.py - Simple matrix multiplication (educational)
- 02_matrix_multiply_tiled.py - Optimized with shared memory tiling
- 02_matrix_multiply_optimized.py - Advanced optimizations
- 03_triangle_rendering.py - Complete graphics pipeline
- 04_advanced_features.py - Events, capture scopes, and more
Run any example:
python examples/01_image_blur.py
When to Use PyMetal vs Alternatives
Use PyMetal When
- You need custom GPU algorithms not available in libraries
- You want full control over GPU resources
- You're doing image processing, simulations, or custom compute
- You need to fuse operations for efficiency
- You want to learn GPU programming on Apple Silicon
- You need rasterization or compute pipelines (ray tracing coming in Phase 4)
Use NumPy/SciPy When
- Standard operations (matrix multiply, FFT, convolution)
- Prototyping and development speed matters
- Small datasets where GPU overhead dominates
- Apple's Accelerate framework provides optimizations
Hybrid Approach
Most applications use both:
- NumPy for standard linear algebra
- PyMetal for custom kernels and GPU-specific operations
- Example: NumPy for matrix ops, PyMetal for custom activation functions
Performance Tips
- Use Shared Storage Mode for CPU-GPU data transfer
- Batch operations - submit multiple dispatches per command buffer
- Optimize thread group size - typically 64-256 threads per group
- Use shared/threadgroup memory for data reuse
- Profile with Instruments - Xcode's GPU profiling tools work great
- Release GIL - PyMetal properly releases GIL during blocking operations
Project Structure
pymetal-cpp/
├── src/
│ ├── _pymetal.cpp # Main C++ bindings
│ └── pymetal/
│ ├── __init__.py # Python module exports
│ ├── __init__.pyi # Type stubs for IDE support
│ ├── exceptions.py # Custom exception hierarchy
│ ├── enums.py # Enumeration submodule
│ ├── types.py # Utility types submodule
│ ├── compute.py # Compute pipeline submodule
│ ├── graphics.py # Graphics pipeline submodule
│ ├── advanced.py # Advanced features submodule
│ └── shader.py # Shader preprocessing utilities
├── docs/
│ └── THREAD_SAFETY.md # Thread safety documentation
├── examples/ # 6 practical examples
│ ├── 01_image_blur.py
│ ├── 02_matrix_multiply_*.py
│ ├── 03_triangle_rendering.py
│ └── 04_advanced_features.py
├── tests/ # 110 unit tests
│ ├── test_phase1_compute.py
│ ├── test_phase2_graphics.py
│ ├── test_phase2_advanced.py
│ ├── test_phase3_advanced.py
│ ├── test_validation.py # Exception and validation tests
│ ├── test_benchmarks.py # Performance regression tests
│ ├── test_edge_cases.py # Boundary condition tests
│ └── test_new_features.py # Submodules, multi-device, shader tests
├── thirdparty/
│ └── metal-cpp/ # Apple's Metal C++ headers
├── CMakeLists.txt # Build configuration
├── pyproject.toml # Python package metadata
└── README.md # This file
Testing
Run the test suite:
make test
# or
pytest
All 110 tests cover:
- Device and buffer management
- Compute pipeline execution
- Graphics pipeline rendering
- Advanced features (events, capture scopes, etc.)
- Memory management and synchronization
- Custom exception hierarchy and validation
- Performance regression benchmarks
- Edge cases and boundary conditions
- Multi-device enumeration
- Shader preprocessing utilities
Roadmap / Future Work
Potential Features (On-Demand)
Ray Tracing Support:
- Acceleration structure creation and management
- Ray tracing pipeline descriptors
- Intersection function tables
- Ray/primitive intersection queries
Additional Features:
- Resource heaps with placement
- Sparse textures
- Indirect argument buffers
- Metal Performance Shaders (MPS) integration
- Async compute and graphics overlap
- Multi-GPU support (copy_all_devices, device selection properties)
Tooling:
- Shader preprocessing utilities (ShaderPreprocessor, ShaderTemplate)
- Performance profiling helpers
- Memory leak detection
- Automatic optimization suggestions
Language Bindings:
- Type stubs for better IDE support (
pymetal/__init__.pyi) - Documentation generator from C++ comments
- Organized namespace (pymetal.enums, pymetal.compute, etc.)
These features can be implemented as needed. Contributions welcome!
Contributing
Contributions welcome! Areas of interest:
- Ray tracing support (most requested)
- Additional examples and tutorials
- Performance optimizations
- API coverage improvements
- Documentation enhancements
- Bug fixes and testing
Acknowledgments
- Built on Apple's metal-cpp
- Uses nanobind for Python bindings
- Inspired by the need for low-level GPU access from Python on macOS
- Claude Code from Anthropic
Resources
- Metal Shading Language Specification
- Metal Best Practices Guide
- Metal Programming Guide
- PyMetal Examples
Support
- Issues: Open an issue on GitHub
- Examples: See
examples/directory - Tests: See
tests/directory for API usage patterns
Note: PyMetal is designed for educational and research purposes. For production graphics applications, consider using established game engines or frameworks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymetal_cpp-0.1.5.tar.gz.
File metadata
- Download URL: pymetal_cpp-0.1.5.tar.gz
- Upload date:
- Size: 297.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1118062f635277a2781659fbc4e7e5bac7d565d6948a3461c3b04d6425c15a0e
|
|
| MD5 |
9d8ebe295ac706a6652f66d2643009c3
|
|
| BLAKE2b-256 |
e4ea30ad4457c69bf59af7227d9abcdf2a09b4fae1e64283b305799fb8adce7a
|
File details
Details for the file pymetal_cpp-0.1.5-cp312-abi3-macosx_15_0_arm64.whl.
File metadata
- Download URL: pymetal_cpp-0.1.5-cp312-abi3-macosx_15_0_arm64.whl
- Upload date:
- Size: 191.0 kB
- Tags: CPython 3.12+, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
331605f03c3f52c16eadf6edb072bd945390020a036fe07fce3050fe7cadea06
|
|
| MD5 |
b4af6c9674eb03aa6abf7bc926f1e960
|
|
| BLAKE2b-256 |
df1da5cb5b414d764999f0368252d2fe467dc910981e7b887269ff033de9c59c
|
File details
Details for the file pymetal_cpp-0.1.5-cp311-cp311-macosx_15_0_arm64.whl.
File metadata
- Download URL: pymetal_cpp-0.1.5-cp311-cp311-macosx_15_0_arm64.whl
- Upload date:
- Size: 192.1 kB
- Tags: CPython 3.11, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b7dea3f89196014f5ac6468f889624a769403880516df109e3e90f83b94e962
|
|
| MD5 |
21e7c46d66a7e03293f114745a5e4c91
|
|
| BLAKE2b-256 |
8dcf6317d40d58b08bb2a49e64ab01cb32a80556b1753e3719fd2740eac6f164
|
File details
Details for the file pymetal_cpp-0.1.5-cp310-cp310-macosx_15_0_arm64.whl.
File metadata
- Download URL: pymetal_cpp-0.1.5-cp310-cp310-macosx_15_0_arm64.whl
- Upload date:
- Size: 191.8 kB
- Tags: CPython 3.10, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
967d0f08dcfc1d948bb638a9ac812203ae30f006ba6bcfd95107d2bc045a1b81
|
|
| MD5 |
63a1285180f8edb66810b9cbec4093cc
|
|
| BLAKE2b-256 |
5910e9ba0287718cc01a5ef3f1c19b8bf36aba161910aa508bc7185e27083f81
|
File details
Details for the file pymetal_cpp-0.1.5-cp39-cp39-macosx_15_0_arm64.whl.
File metadata
- Download URL: pymetal_cpp-0.1.5-cp39-cp39-macosx_15_0_arm64.whl
- Upload date:
- Size: 192.0 kB
- Tags: CPython 3.9, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1e091ba7b0a4a9e16e443d31fd0233a3dac349ca8d3ed7a1d8d2ad3c3993500
|
|
| MD5 |
2ae5e28b25587288315b87930a73189a
|
|
| BLAKE2b-256 |
df3413e562c2acf99b459ce68955ce614f40eba7c7b58727b40982a7261119bb
|