SAPPHIRE: High-Performance Compute Acceleration Framework for Apple Silicon
Project description
๐ฅ SAPPHIRE: The NVIDIA CUDA Killer for Apple Silicon ๐ฅ
SAPPHIRE is a complete CUDA replacement that extracts 1.6 TFLOPS from Apple Silicon's AMX accelerator. Train and run AI models on Mac Mini for 50x less cost and 23x less power than NVIDIA hardware.
๐ Performance
| Operation | SAPPHIRE | NVIDIA H100* |
|---|---|---|
| SGEMM | 1.56 TFLOPS | 60 TFLOPS |
| Flash Attention | 943 GFLOPS | ~20 TFLOPS |
| Conv2D | 1.57 TFLOPS | ~30 TFLOPS |
| INT8 Quantize | 6.3 B elem/s | ~50 B elem/s |
H100 costs $30,000 and uses 700W. Mac Mini costs $599 and uses 30W.
Price/Performance: SAPPHIRE wins by 50x!
๐ฆ Installation
pip install sapphire-compute
๐ฅ Quick Start
import sapphire
import numpy as np
# Matrix multiplication at 1.6 TFLOPS
A = np.random.randn(4096, 4096).astype(np.float32)
B = np.random.randn(4096, 4096).astype(np.float32)
C = sapphire.matmul(A, B) # Uses AMX!
# Flash Attention V5
Q = np.random.randn(2, 16, 512, 64).astype(np.float32)
K = np.random.randn(2, 16, 512, 64).astype(np.float32)
V = np.random.randn(2, 16, 512, 64).astype(np.float32)
out = sapphire.flash_attention(Q, K, V)
# CUDA compatibility (drop-in replacement!)
cuda = sapphire.cuda
cuda.is_available() # True on Mac!
๐ง LLM Inference
from sapphire.llm import LlamaInference
# Load and run Llama on Mac Mini
model = LlamaInference("meta-llama/Llama-2-7b")
output = model.generate("The future of AI is", max_tokens=100)
print(output)
๐ S-Fabric Clustering
Connect multiple Mac Minis for distributed compute:
from sapphire.sfabric import Cluster
# Create cluster over Thunderbolt 5
cluster = Cluster(["mac1:9999", "mac2:9999", "mac3:9999"])
cluster.connect()
# Distributed training
cluster.allreduce(gradients)
๐๏ธ Architecture
SAPPHIRE Stack
โโโ Python API (numpy-compatible)
โโโ Native Library (159 C functions)
โ โโโ SGEMM (cblas โ AMX)
โ โโโ Flash Attention V5
โ โโโ Conv2D (cuDNN replacement)
โ โโโ Quantization (INT8/INT4)
โ โโโ cuSOLVER (LU, QR, SVD, Cholesky)
โโโ Lariat Transpiler (CUDA โ Sapphire)
โโโ S-Fabric RDMA (Multi-Mac clustering)
๐ Benchmarks
Run the full benchmark suite:
python -m sapphire.benchmark
๐ฏ Key Features
- 159 Native Functions: Complete ML/AI operation coverage
- Flash Attention V5: Memory-efficient attention at 943 GFLOPS
- Zero-Copy UMA: Unified Memory Architecture exploitation
- Lariat CUDA Transpiler: Run CUDA code unchanged
- S-Fabric RDMA: Thunderbolt 5 multi-Mac clustering
- INT8 Quantization: 6.3 billion elements/second
๐ NVIDIA Comparison
| Metric | Mac Mini + Sapphire | NVIDIA H100 |
|---|---|---|
| Cost | $599 | $30,000 |
| Power | 30W | 700W |
| TFLOPS/$ | 0.0026 | 0.002 |
| TFLOPS/W | 0.052 | 0.086 |
Conclusion: For most AI workloads, Sapphire on Mac Mini is the most cost-effective solution.
๐ License
MIT License - Use freely, no NVIDIA required!
๐ Credits
Built by Svector Corporation - Making AI accessible to everyone.
๐ฅ NVIDIA's monopoly is over. The future runs on Apple Silicon. ๐ฅ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sapphire_compute-1.0.0-py3-none-any.whl.
File metadata
- Download URL: sapphire_compute-1.0.0-py3-none-any.whl
- Upload date:
- Size: 212.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02e261fe6c78a74b1190be3714d862a3de40773d1b7781371a167f349646168d
|
|
| MD5 |
e79fca6f3ed295001de4af025ef57f14
|
|
| BLAKE2b-256 |
81ddbc590b61b37e0c8303acf338683785e30f9d180dfecc3ef939751a825978
|