Skip to main content

CuPy-compatible GPU array library for Apple Silicon using MetalGPU

Project description

  __  __            __  __      _        _ ____
 |  \/  | __ _  ___|  \/  | ___| |_ __ _| |  _ \ _   _
 | |\/| |/ _` |/ __| |\/| |/ _ \ __/ _` | | |_) | | | |
 | |  | | (_| | (__| |  | |  __/ || (_| | |  __/| |_| |
 |_|  |_|\__,_|\___|_|  |_|\___|\__\__,_|_|_|    \__, |
                                                  |___/

MacMetalPy

Shred data on Apple Silicon. No CUDA required.

A CuPy-compatible GPU array library that rips through computation on Apple Silicon using the Metal backend. Drop it into your existing CuPy code, swap the import, and let your M-series chip absolutely shred.

Heads up: Metal GPUs operate in float32 — there is no hardware float64. MacMetalPy auto-downcasts float64 → float32 by default (with warnings), or can fall back to CPU. See Float Precision for details.

import macmetalpy as cp

a = cp.random.randn(4096, 4096)
b = cp.random.randn(4096, 4096)
c = a @ b  # 🔥 Metal GPU goes brrr

The Setlist

  • Drop-in CuPy replacementimport macmetalpy as cp and your existing code just works
  • 200+ NumPy-compatible functions — creation, math, linalg, FFT, random, indexing, sorting, reductions, and more
  • Async Metal dispatch — operations fire off to the GPU and don't wait around
  • RawKernel — write your own Metal Shading Language kernels when the built-in riffs aren't enough
  • 17,000+ passing tests — battle-tested across 10 dtypes and every edge case we could throw at it
  • Zero CUDA dependency — pure Apple Silicon, pure Metal

Plug In & Play

pip install macmetalpy

Requirements:

  • macOS (Apple Silicon — M1/M2/M3/M4)
  • Python >= 3.10
  • numpy >= 2.0
  • metalgpu >= 0.0.5

Soundcheck

Create arrays on the GPU:

import macmetalpy as cp

a = cp.zeros((1000, 1000))
b = cp.ones((1000, 1000))
c = cp.arange(0, 100, dtype=cp.int32)    # explicit int dtype
d = cp.linspace(0, 1, 256, dtype=cp.float16)  # half precision

Rip through math:

import macmetalpy as cp

x = cp.random.randn(10000)

# Elementwise operations — all on the GPU
y = cp.sqrt(cp.abs(x)) + cp.exp(-x ** 2)

# Reductions
total = cp.sum(y)
avg = cp.mean(y)

Linear algebra:

import macmetalpy as cp

A = cp.random.randn(512, 512)
b = cp.random.randn(512)

x = cp.linalg.solve(A, b)          # Solve Ax = b
U, S, Vt = cp.linalg.svd(A)        # SVD
eigenvalues = cp.linalg.eigvalsh(A @ A.T)  # Eigenvalues

Pull results back to CPU:

gpu_result = cp.sum(cp.random.randn(1000000))
numpy_array = gpu_result.get()  # Transfer to NumPy

Benchmarks — When Does the GPU Shred?

MacMetalPy vs NumPy on an M4 Mac Mini, float32. Small arrays use optimized CPU paths (NumPy SIMD is hard to beat below 100K elements), while the GPU shines on large compute-heavy workloads and specialized operations.

The Scaling Story

Operation 1K 100K 1M
a + b 0.40x 0.93x 0.98x
sin(a) 0.69x 0.98x 2.41x
exp(a) 0.72x 0.96x 2.12x
cumsum(a) 0.59x 1.40x 1.39x
floor_divide 0.79x 3.04x 11.12x
mod(a, b) 0.76x 2.35x 7.57x
randn(a) 2.88x 3.71x 3.69x
normal(a) 2.02x 2.48x 2.47x
sort(a) 0.84x 0.93x 1.57x
searchsorted 0.01x 4.12x 32.50x

Values are speedup vs NumPy (higher = faster). Bold = MacMetalPy wins.

Where MacMetalPy Shreds

Category Avg Speedup Highlights
Creation (f64) 10.45x array() 109x at 1M — skips float64 intermediates
Creation 2.74x array() 55x at 1M
Sorting 3.35x searchsorted 32.5x at 1M, sort 1.6x at 1M
Trig 1.88x arccos 8.3x, arcsin 8.3x at 1M — GPU shines at scale
Random 1.86x randn 3.7x, normal 2.5x — native float32 generation
Ufuncs 1.82x fabs 10x, logaddexp 11.5x at 1M
Math 1.69x floor_divide 11x, mod 7.6x at 1M

By Category at 100K / 1M Elements

Category 100K 1M Notes
Random 2.03x 2.04x Native float32 via Generator API
Sorting 1.55x 7.87x searchsorted 32.5x at 1M
Ufuncs 1.48x 3.36x GPU dominates at scale
Creation 1.44x 6.33x Dtype conversion bypass at scale
Math 1.19x 3.30x floor_divide 11x at 1M
Trig 1.03x 3.88x GPU wins decisively at 1M
Reductions 0.94x 0.88x cumsum 1.39x
Comparisons 0.90x 0.97x Near-parity

The Rule of Thumb

Array Size Who Wins Why
< 10K NumPy Python dispatch overhead dominates
10K – 100K Roughly even CPU SIMD paths match NumPy
100K – 1M GPU wins many Trig, math, sorting, ufuncs all >1x; random/creation dominate
1M+ GPU shreds Metal dispatch amortized, massive parallelism wins

Run the benchmarks yourself: python benchmarks/bench_vs_numpy.py --numpy-cache


The Lineup

Module Functions What it shreds
Creation 25 zeros, ones, arange, linspace, eye, meshgrid, ...
Math 94 sqrt, exp, log, sin, cos, dot, where, clip, ...
Reductions 21 sum, mean, std, var, argmax, cumsum, median, ...
Linalg 25 solve, inv, svd, eigh, qr, det, norm, einsum, ...
Manipulation 33 reshape, transpose, concatenate, stack, pad, tile, ...
Indexing 23 take, put, nonzero, argwhere, fill_diagonal, ...
Sorting 9 sort, argsort, unique, searchsorted, partition, ...
FFT 19 fft, ifft, rfft, fft2, fftn, fftfreq, ...
Random 40+ randn, uniform, normal, poisson, choice, shuffle, ...
Logic & Bitwise 30 logical_and, greater, bitwise_xor, gcd, lcm, ...
NaN Ops 27 nansum, nanmean, histogram, corrcoef, gradient, ...
Set Ops 7 union1d, intersect1d, setdiff1d, isin, ...

Custom Riffs

When the built-in operations don't cut it, write your own Metal Shading Language kernels with RawKernel:

from macmetalpy import RawKernel
import macmetalpy as cp

# Write a custom Metal kernel
kernel_source = """
#include <metal_stdlib>
using namespace metal;

kernel void saxpy(device float *x [[buffer(0)]],
                  device float *y [[buffer(1)]],
                  device float *out [[buffer(2)]],
                  uint id [[thread_position_in_grid]]) {
    float alpha = 2.5f;
    out[id] = alpha * x[id] + y[id];
}
"""

saxpy = RawKernel(kernel_source, 'saxpy')

N = 1_000_000
x = cp.random.randn(N)
y = cp.random.randn(N)
out = cp.empty(N)

saxpy(N, (x, y, out))  # Launch N GPU threads

result = out.get()

Grid sizes can be 1D, 2D, or 3D:

kernel(N, args)              # 1D — N threads
kernel((W, H), args)         # 2D grid
kernel((W, H, D), args)      # 3D grid

Float Precision & The float64 Question

This is the biggest difference between MacMetalPy and NumPy/CuPy.

Apple's Metal GPU has no native float64 (double) support. All GPU computation runs in float32 (single precision) or float16 (half precision). This is a hardware limitation — not a software one.

What this means in practice

Scenario What happens
cp.array([1.0, 2.0]) Created as float32 (NumPy would default to float64)
cp.zeros(10, dtype=np.float64) Downcast to float32 with a warning (by default)
cp.linalg.solve(A, b) Runs in float32 — ~7 decimal digits of precision
cp.sum(x, dtype=np.float64) Accumulates in float32
complex128 input Downcast to complex64 (two float32 values)

When float32 is fine (most cases)

  • Machine learning / deep learning (models train in float16/float32 anyway)
  • Image and signal processing
  • General scientific computing where ~7 digits of precision is sufficient
  • Data analysis and statistics on reasonably-scaled data
  • FFT, random number generation, sorting, indexing

When you might need float64

  • Numerical methods sensitive to rounding (e.g., ill-conditioned linear systems)
  • Financial calculations requiring exact decimal precision
  • Accumulating very large sums (billions of elements) where error compounds
  • Algorithms that rely on the full 15-16 digits of float64 precision

Configuring float64 behavior

from macmetalpy import set_config

# DEFAULT: Downcast float64 → float32, emit a warning
set_config(float64_behavior="downcast", warn_on_downcast=True)

# Silence the warnings if you know what you're doing
set_config(float64_behavior="downcast", warn_on_downcast=False)

# Fall back to CPU (NumPy) for any float64 operation
set_config(float64_behavior="cpu_fallback")

# Set the default float dtype for creation functions
set_config(default_float_dtype="float32")

Comparison with NumPy and CuPy

NumPy (CPU) CuPy (CUDA) MacMetalPy (Metal)
Default float float64 float64 float32
float64 support Native Native Downcast or CPU fallback
float16 support Software Native Native
complex128 Native Native Downcast to complex64
int8 / uint8 Native Native Not supported
Precision digits ~15-16 ~15-16 ~7 (float32)

Supported Amps

Dtype Metal Type Notes
float32 float Default float — full GPU support
float16 half Half precision — fastest for large arrays
int32 int Default int type
int64 long 64-bit integer
int16 short 16-bit integer
uint32 uint Unsigned 32-bit
uint64 uint64_t Unsigned 64-bit
uint16 uint16_t Unsigned 16-bit
bool bool Boolean
complex64 float32 pairs Stored as real/imag float32

Not supported by Metal: float64, complex128, int8, uint8, longdouble, str_, bytes_, object_


Acknowledgments

MacMetalPy stands on the shoulders of giants:

  • NumPy — The foundation. MacMetalPy's API is modeled after NumPy's, because they got it right the first time.
  • CuPy — The blueprint for GPU array libraries. CuPy proved that a drop-in NumPy replacement on the GPU is both possible and practical.
  • metalgpu — The engine under the hood. Without metalgpu's Python-to-Metal bridge, MacMetalPy wouldn't exist.

The Crew

License: MIT

Contributing: Issues and PRs welcome. If you find a bug or want to add a new function, open an issue or submit a pull request.

Built by @grantkl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macmetalpy-0.1.3.tar.gz (236.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macmetalpy-0.1.3-cp312-cp312-macosx_10_13_universal2.whl (130.5 kB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

File details

Details for the file macmetalpy-0.1.3.tar.gz.

File metadata

  • Download URL: macmetalpy-0.1.3.tar.gz
  • Upload date:
  • Size: 236.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macmetalpy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 89d504fa914b0c79e65caccf18c9d2b03497da19a574091b70ef87f7b133e983
MD5 f4e63c8ab4d6c3d2247ffdccf926bc81
BLAKE2b-256 362a784e6a17416d5aa2482c3b7315ecfd9d92086c4e141a0cbd8a6b96161528

See more details on using hashes here.

Provenance

The following attestation bundles were made for macmetalpy-0.1.3.tar.gz:

Publisher: release.yml on grantkl/MacMetalPy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file macmetalpy-0.1.3-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for macmetalpy-0.1.3-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 e68282b47718f68d283972f4e65885a83c237f753154b5070d5d6212415e978f
MD5 33efec6f25b9a09d2d27d9cec50b9b5e
BLAKE2b-256 8646d78ef47905bc2c7607a6884edcc06ec6f6017efdc35be4d878ee6bb8eafd

See more details on using hashes here.

Provenance

The following attestation bundles were made for macmetalpy-0.1.3-cp312-cp312-macosx_10_13_universal2.whl:

Publisher: release.yml on grantkl/MacMetalPy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page