CUDA-compatible GPU runtime for Apple Silicon that transparently maps CUDA kernels and memory operations to Metal, enabling Numba and Python GPU workloads with stream overlap and tier-aware memory management.
Project description
MetaXuda
MetaXuda is an experimental CUDA-compatible runtime shim for Apple Silicon, written in Rust, that allows Numba CUDA kernels to run unmodified by transparently mapping CUDA runtime calls to Apple Metal.
It is designed as a drop-in replacement for core CUDA runtime libraries, enabling GPU-accelerated Python workflows on macOS without requiring the NVIDIA CUDA Toolkit or NVIDIA hardware.
✨ Features
-
Drop-in replacement for
libcudart.dylibandlibcuda.dylib -
Run Numba CUDA kernels (
@cuda.jit) directly on Apple Metal -
Metal-backed implementations of core CUDA APIs:
cudaMalloc/cudaFreecudaMemcpy/cudaMemcpyAsynccudaLaunchKernel
-
Asynchronous execution with stream-style overlap (copy / compute / copy)
-
Tier-aware memory management (GPU-first execution)
-
Ships with:
- Stubbed
libdevice.bcfor Numba compatibility - Precompiled Metal
.metallibshaders for fused math operations cuda_pipeline.so, exposing a low-level execution API that allows Numba and other callers to bypass the CUDA runtime shim and dispatch operations directly
- Stubbed
-
No CUDA Toolkit, NVIDIA drivers, or NVIDIA GPU required
⚠️ Project Status
Alpha / Research Prototype
MetaXuda is under active development and currently targets:
- Numba CUDA kernels
- Single-GPU execution on Apple Silicon
Not all CUDA APIs are implemented, and behavior may differ from NVIDIA CUDA in edge cases.
⚙️ Installation
Requirements
- macOS 13+
- Python >= 3.10
- NumPy >= 1.23
- Numba >= 0.59
Install (Editable / Dev)
# Clone the repository
git clone https://github.com/perinban/MetaXuda.git
cd MetaXuda
# Install in editable mode
pip install -e .
The installation places the required shim libraries (libcudart.dylib, libcuda.dylib, and libdevice.bc) inside the package so they can be discovered by Numba at runtime.
📂 Package Layout
MetaXuda ships demos and helper modules inside the Python package so they are available in editable and installed modes:
metaxuda/
├── buffers/ # GPU, managed, and tiered buffer abstractions
├── execution/ # Direct and pooled execution backends
├── streams/ # Stream and async execution helpers (Numba-compatible)
├── demos/ # End-to-end demos and debug examples
├── native/ # Native shims and pipelines
│ ├── libcudart.dylib
│ ├── libcuda.dylib
│ ├── libnvvm.dylib
│ ├── libdevice.bc
│ └── cuda_pipeline.so
├── env.py # Environment detection and setup
├── patch.py # Numba / runtime patching hooks
└── __init__.py
The demos/ directory contains runnable examples covering kernel execution, buffers, streams, disk tiering, and the direct math pipeline.
You can run them directly once the package is installed:
python -m metaxuda.demos.add
python -m metaxuda.demos.pipeline
🚀 Usage
Once installed, existing Numba CUDA code should run without modification:
from numba import cuda
import numpy as np
@cuda.jit
def add(a, b, out):
i = cuda.grid(1)
if i < out.size:
out[i] = a[i] + b[i]
n = 1024
a = np.arange(n, dtype=np.float32)
b = np.arange(n, dtype=np.float32)
out = np.zeros_like(a)
add[32, 32](a, b, out)
print(out[:5])
Execution is transparently dispatched to Metal via the MetaXuda runtime.
🗜️ Quantization, Compression, and Disk Tiering
MetaXuda supports quantized and compressed data storage for non-resident buffers and intermediate results. These behaviors are controlled via environment variables and handled by the runtime initialization logic in env.py.
This is primarily used for Tier‑3 (disk-backed) storage, allowing large workloads to exceed GPU memory limits while minimizing I/O and storage overhead.
Environment Configuration
The shim reads the following environment variables at startup:
-
MX_ENABLE_DATASTORE_COMPRESSION(default:1) Enable or disable compression for spilled data blocks. -
MX_DATASTORE_COMPRESSION_TYPE(default:lz4) Compression algorithm to use (e.g.lz4). -
MX_DATASTORE_COMPRESSION_LEVEL(default:3) Compression level passed to the backend compressor. -
MX_DISK_PARALLELISM_LEVEL(default:auto) Controls parallel read/write behavior for disk operations. -
MX_DISK_SPILL_ENABLED(default:0) Enable spilling GPU buffers to disk when memory pressure occurs. -
MX_TIER3_STRATEGY(default:prefer_external) Strategy for selecting Tier‑3 storage locations. -
MX_TIER3_INTERNAL_PATH(default:block_store) Directory used for internal Tier‑3 storage. -
MX_TIER3_EXTERNAL_DEVICES(format:id:path,id:path) Comma‑separated list of external devices or paths for Tier‑3 storage. -
MX_DEBUG(options:memory) Enable debug logging for specific subsystems.
These settings allow fine‑grained control over compression, quantization, disk spill behavior, and debugging without changing application code.
🧮 Operation Coverage
MetaXuda includes a precompiled Metal math pipeline (cuda_pipeline.so) implementing a broad set of scalar and elementwise operations that can be invoked directly by Numba or higher-level tooling.
-
230+ operations covering:
- Arithmetic, comparison, and logical ops
- Trigonometric and hyperbolic functions
- Exponentials, logarithms, and powers
- Reductions and distance metrics
- Activation functions (ReLU, GELU, SiLU, Mish, etc.)
- Probability distributions and loss functions
- Signal, interpolation, and utility math
-
Each operation is mapped to a corresponding Metal expression
-
Selected ops support fast-math variants where numerically safe
This allows many Numba-generated kernels to execute without requiring full PTX → Metal translation, significantly reducing overhead.
🧠 Architecture Overview
- Rust-based CUDA shim implementing core CUDA runtime APIs
- Metal compute pipelines for kernel execution
- Stubbed NVVM / libdevice layer for Numba compilation compatibility
- Python package acts as a loader and distribution mechanism for native libraries
License
MetaXuda is free for students and personal use. Commercial use requires a license.
- 🎓 Students: Free with valid educational email
- 👤 Personal: Free for non-commercial projects
- 🏢 Commercial: Contact p.perinban@gmail.com
See LICENSE for full terms.
🙏 Disclaimer
MetaXuda is not affiliated with NVIDIA. CUDA is a trademark of NVIDIA Corporation. This project is an independent compatibility layer intended for research and development purposes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metaxuda-2.0.0.tar.gz.
File metadata
- Download URL: metaxuda-2.0.0.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
124fa7cdb87b10dc264ffb8e166c61e9b2d3dbfc986c27836173bb0847304b9c
|
|
| MD5 |
c6f4434217d83a1fbbf7374ccac92f89
|
|
| BLAKE2b-256 |
4fa2e7b58449f53a08036ec2044846b50096ddd74825489f3be2330cd6cf178c
|
File details
Details for the file metaxuda-2.0.0-py3-none-any.whl.
File metadata
- Download URL: metaxuda-2.0.0-py3-none-any.whl
- Upload date:
- Size: 2.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f49ceeb5d25f7315d939c4615e927c19ac19b2c40994d4d92019a1766da5058d
|
|
| MD5 |
35ab909cfae6331bb6274a0d091a135d
|
|
| BLAKE2b-256 |
50b89a25311e902d8752f7ddf9abf3b683c992296a2cbc5b91e9661866aa5f21
|