Skip to main content

CUDA-compatible GPU runtime for Apple Silicon that transparently maps CUDA kernels and memory operations to Metal, enabling Numba and Python GPU workloads with stream overlap and tier-aware memory management.

Project description

MetaXuda

MetaXuda is an experimental CUDA-compatible runtime shim for Apple Silicon, written in Rust, that allows Numba CUDA kernels to run unmodified by transparently mapping CUDA runtime calls to Apple Metal.

It is designed as a drop-in replacement for core CUDA runtime libraries, enabling GPU-accelerated Python workflows on macOS without requiring the NVIDIA CUDA Toolkit or NVIDIA hardware.


✨ Features

  • Drop-in replacement for libcudart.dylib and libcuda.dylib

  • Run Numba CUDA kernels (@cuda.jit) directly on Apple Metal

  • Metal-backed implementations of core CUDA APIs:

    • cudaMalloc / cudaFree
    • cudaMemcpy / cudaMemcpyAsync
    • cudaLaunchKernel
  • Asynchronous execution with stream-style overlap (copy / compute / copy)

  • Tier-aware memory management (GPU-first execution)

  • Ships with:

    • Stubbed libdevice.bc for Numba compatibility
    • Precompiled Metal .metallib shaders for fused math operations
    • cuda_pipeline.so, exposing a low-level execution API that allows Numba and other callers to bypass the CUDA runtime shim and dispatch operations directly
  • No CUDA Toolkit, NVIDIA drivers, or NVIDIA GPU required


⚠️ Project Status

Alpha / Research Prototype

MetaXuda is under active development and currently targets:

  • Numba CUDA kernels
  • Single-GPU execution on Apple Silicon

Not all CUDA APIs are implemented, and behavior may differ from NVIDIA CUDA in edge cases.


⚙️ Installation

Requirements

  • macOS 13+
  • Python >= 3.10
  • NumPy >= 1.23
  • Numba >= 0.59

Install (Editable / Dev)

# Clone the repository
git clone https://github.com/perinban/MetaXuda.git
cd MetaXuda

# Install in editable mode
pip install -e .

The installation places the required shim libraries (libcudart.dylib, libcuda.dylib, and libdevice.bc) inside the package so they can be discovered by Numba at runtime.


📂 Package Layout

MetaXuda ships demos and helper modules inside the Python package so they are available in editable and installed modes:

metaxuda/
├── buffers/        # GPU, managed, and tiered buffer abstractions
├── execution/      # Direct and pooled execution backends
├── streams/        # Stream and async execution helpers (Numba-compatible)
├── demos/          # End-to-end demos and debug examples
├── native/         # Native shims and pipelines
│   ├── libcudart.dylib
│   ├── libcuda.dylib
│   ├── libnvvm.dylib
│   ├── libdevice.bc
│   └── cuda_pipeline.so
├── env.py          # Environment detection and setup
├── patch.py        # Numba / runtime patching hooks
└── __init__.py

The demos/ directory contains runnable examples covering kernel execution, buffers, streams, disk tiering, and the direct math pipeline.

You can run them directly once the package is installed:

python -m metaxuda.demos.add
python -m metaxuda.demos.pipeline

🚀 Usage

Once installed, existing Numba CUDA code should run without modification:

from numba import cuda
import numpy as np

@cuda.jit
def add(a, b, out):
    i = cuda.grid(1)
    if i < out.size:
        out[i] = a[i] + b[i]

n = 1024
a = np.arange(n, dtype=np.float32)
b = np.arange(n, dtype=np.float32)
out = np.zeros_like(a)

add[32, 32](a, b, out)
print(out[:5])

Execution is transparently dispatched to Metal via the MetaXuda runtime.


🗜️ Quantization, Compression, and Disk Tiering

MetaXuda supports quantized and compressed data storage for non-resident buffers and intermediate results. These behaviors are controlled via environment variables and handled by the runtime initialization logic in env.py.

This is primarily used for Tier‑3 (disk-backed) storage, allowing large workloads to exceed GPU memory limits while minimizing I/O and storage overhead.

Environment Configuration

The shim reads the following environment variables at startup:

  • MX_ENABLE_DATASTORE_COMPRESSION (default: 1) Enable or disable compression for spilled data blocks.

  • MX_DATASTORE_COMPRESSION_TYPE (default: lz4) Compression algorithm to use (e.g. lz4).

  • MX_DATASTORE_COMPRESSION_LEVEL (default: 3) Compression level passed to the backend compressor.

  • MX_DISK_PARALLELISM_LEVEL (default: auto) Controls parallel read/write behavior for disk operations.

  • MX_DISK_SPILL_ENABLED (default: 0) Enable spilling GPU buffers to disk when memory pressure occurs.

  • MX_TIER3_STRATEGY (default: prefer_external) Strategy for selecting Tier‑3 storage locations.

  • MX_TIER3_INTERNAL_PATH (default: block_store) Directory used for internal Tier‑3 storage.

  • MX_TIER3_EXTERNAL_DEVICES (format: id:path,id:path) Comma‑separated list of external devices or paths for Tier‑3 storage.

  • MX_DEBUG (options: memory) Enable debug logging for specific subsystems.

These settings allow fine‑grained control over compression, quantization, disk spill behavior, and debugging without changing application code.


🧮 Operation Coverage

MetaXuda includes a precompiled Metal math pipeline (cuda_pipeline.so) implementing a broad set of scalar and elementwise operations that can be invoked directly by Numba or higher-level tooling.

  • 230+ operations covering:

    • Arithmetic, comparison, and logical ops
    • Trigonometric and hyperbolic functions
    • Exponentials, logarithms, and powers
    • Reductions and distance metrics
    • Activation functions (ReLU, GELU, SiLU, Mish, etc.)
    • Probability distributions and loss functions
    • Signal, interpolation, and utility math
  • Each operation is mapped to a corresponding Metal expression

  • Selected ops support fast-math variants where numerically safe

This allows many Numba-generated kernels to execute without requiring full PTX → Metal translation, significantly reducing overhead.


🧠 Architecture Overview

  • Rust-based CUDA shim implementing core CUDA runtime APIs
  • Metal compute pipelines for kernel execution
  • Stubbed NVVM / libdevice layer for Numba compilation compatibility
  • Python package acts as a loader and distribution mechanism for native libraries

License

MetaXuda is free for students and personal use. Commercial use requires a license.

  • 🎓 Students: Free with valid educational email
  • 👤 Personal: Free for non-commercial projects
  • 🏢 Commercial: Contact p.perinban@gmail.com

See LICENSE for full terms.


🙏 Disclaimer

MetaXuda is not affiliated with NVIDIA. CUDA is a trademark of NVIDIA Corporation. This project is an independent compatibility layer intended for research and development purposes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaxuda-2.0.0.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metaxuda-2.0.0-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file metaxuda-2.0.0.tar.gz.

File metadata

  • Download URL: metaxuda-2.0.0.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for metaxuda-2.0.0.tar.gz
Algorithm Hash digest
SHA256 124fa7cdb87b10dc264ffb8e166c61e9b2d3dbfc986c27836173bb0847304b9c
MD5 c6f4434217d83a1fbbf7374ccac92f89
BLAKE2b-256 4fa2e7b58449f53a08036ec2044846b50096ddd74825489f3be2330cd6cf178c

See more details on using hashes here.

File details

Details for the file metaxuda-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: metaxuda-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for metaxuda-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f49ceeb5d25f7315d939c4615e927c19ac19b2c40994d4d92019a1766da5058d
MD5 35ab909cfae6331bb6274a0d091a135d
BLAKE2b-256 50b89a25311e902d8752f7ddf9abf3b683c992296a2cbc5b91e9661866aa5f21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page