Foundational Metal Linear Algebra Primitives for PyTorch
Project description
metalcore
High-performance Metal-accelerated linear algebra and training operations for PyTorch on Apple Silicon.
Overview
metalcore provides optimized custom Metal kernels for PyTorch on macOS, bypassing generic MPS fallbacks for significantly faster computation.
Installation
pip install metalcore
Key Features
Linear Algebra
- SVD: Jacobi algorithm, 25x faster for LLM weight matrices
- QR: Blocked Householder, 20x faster batched
- Eigh: Symmetric eigendecomposition, 3.5x faster
- Cholesky: MAGMA-style, 33x faster batched
- Solve: LU-based, 10x faster batched (fp16/bf16 supported)
Training Ops
- RMSNorm (
MetalRMSNorm): ~1.5x faster than PyTorch! - AdamW (
MetalAdamW): 2.4x faster optimizer - SiLU (
metal_silu): 1.1x faster - EmbeddingBag: 6x faster (avoids CPU fallback)
- LayerNorm, Softmax: Fused implementations
RoPE (NEW in v0.1.14)
apply_rotary_pos_emb: Metal-accelerated rotary embeddings (3.4x faster)RotaryEmbedding: Drop-in HuggingFace replacement modulepatch_transformers_rope: Auto-patches Llama/Mistral/Qwen models
INT4 Quantization
- Hybrid approach (recommended):
Int4Linear.from_float(linear, dequant_on_load=True)- Store as INT4 (7x disk compression), dequant to FP16 at load → 0.6ms matmul
- GGML block_q4_0 (llama.cpp compatible):
quantize_ggml_q4_0,matmul_ggml_q4_0- Ported from llama.cpp using
simdgroup_multiply_accumulate - 4-15x overhead vs FP16 (36x faster than naive)
- Enables larger models: 7B→3.5GB, 70B→35GB
- Ported from llama.cpp using
PyTorch Integration
import metalcore
# Automatically accelerate F.silu, F.gelu, F.embedding_bag, torch.linalg.svd/qr
metalcore.enable_pytorch_overrides()
# Works seamlessly with HuggingFace models
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...", device_map="mps")
# Optional: Also patch RMSNorm and RoPE modules
metalcore.patch_transformers_rmsnorm(model)
metalcore.patch_transformers_rope(model)
Quick Start
import torch
import metalcore
device = 'mps'
# SVD
A = torch.randn(100, 50, device=device)
U, S, V = metalcore.svd(A)
# Batched QR
B = torch.randn(100, 16, 16, device=device)
Q, R = metalcore.qr(B)
# Linear Solve (fp16/bf16 supported)
A = torch.randn(100, 32, 32, device=device)
b = torch.randn(100, 32, device=device)
x = metalcore.solve(A, b)
# Training Ops
from metalcore import MetalRMSNorm, MetalAdamW, metal_gelu
norm = MetalRMSNorm(512).to(device)
x = torch.randn(32, 128, 512, device=device)
y = norm(x)
model = torch.nn.Linear(512, 256).to(device)
optimizer = MetalAdamW(model.parameters(), lr=1e-3)
y = metal_gelu(x)
Performance Highlights
| Operation | Speedup |
|---|---|
| RMSNorm | ~1.5x |
| EmbeddingBag | 6x (vs CPU fallback) |
| AdamW | 2.4x |
| RoPE | 3.4x |
| SiLU | 1.1x |
| QR Batched | up to 20x |
| SVD (large) | up to 12x |
| Fused MLP Bwd | 5-6x (vs Autograd) |
| Fused Attn Bwd | Parity with FP16 |
Requirements
- macOS 12.0+ with Apple Silicon (M1/M2/M3/M4)
- Python 3.9 - 3.14
- PyTorch 2.0+
Note: M3/M4 chips recommended for best bf16 performance. The library gracefully falls back to FP32 on older hardware.
Author
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metalcore-0.1.14-cp314-cp314-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp314-cp314-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.14, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efb0db2b3fff898143be6198c9c41bdcf05d19f0fb8fbd2610ab19cf144066b6
|
|
| MD5 |
9b358ec9fae84d61cc10c65067c3d823
|
|
| BLAKE2b-256 |
80383e2b84a76d2ddb6c31e6728fa08d1c730d7760602b7fee5f8b1e75f6b5ab
|
File details
Details for the file metalcore-0.1.14-cp313-cp313-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp313-cp313-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.13, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51b08ccaaab03341439dd70c03fe17dcfeaa23f68da8a2733742ca53a435bdc3
|
|
| MD5 |
434fd8e876461d75f2802a98f31d6f65
|
|
| BLAKE2b-256 |
2854d7fe27d51bfa307d0f707a2ee2b13bb6690274bc98f6f024f815806e27b9
|
File details
Details for the file metalcore-0.1.14-cp312-cp312-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp312-cp312-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d55370cbe73329d801cdfe5776a84701a9231d5ca7cc88f88f9de860b57848a8
|
|
| MD5 |
f530f46afdcbf3416b3bcedbb4a1886f
|
|
| BLAKE2b-256 |
28b34ca0608a916948af82c65be27da17b4f49e251339cf44dedb1c29a871bb2
|
File details
Details for the file metalcore-0.1.14-cp311-cp311-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp311-cp311-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03b3cdd40ff5b79c2ea4f492d6d34a6ce5fa949daac64baa7d92182e906f1636
|
|
| MD5 |
802b970cb506e07d0b0e97ca01705118
|
|
| BLAKE2b-256 |
840540d84c545944f33f5d355ea8477ddd94fa2a7b965a35dd3349f3607a1349
|
File details
Details for the file metalcore-0.1.14-cp310-cp310-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp310-cp310-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c060e00abece8605d0ea456c604b34ce3628cf2090a9bb37555c6d039eb1d08
|
|
| MD5 |
ab3402eece30112bb7c08858232fd0ba
|
|
| BLAKE2b-256 |
1f9c265e02e0cea4a952f9623b667087c25bf2a4ed0fca4cd2dd97ee15f917bd
|
File details
Details for the file metalcore-0.1.14-cp39-cp39-macosx_15_0_arm64.whl.
File metadata
- Download URL: metalcore-0.1.14-cp39-cp39-macosx_15_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15ac1c7bd965f093aad09cb439ba7ee84bcee3ff01d5feaa4dafcad44dbcba3c
|
|
| MD5 |
da34089b5f0b596152e018119d516c71
|
|
| BLAKE2b-256 |
af9de7da97efabd4501540584cdb0ede3dd109c4d45e61e0df79f20b2eeebd59
|