Skip to main content

Foundational Metal Linear Algebra Primitives for PyTorch

Project description

metalcore

High-performance Metal-accelerated linear algebra and training operations for PyTorch on Apple Silicon.

Overview

metalcore provides optimized custom Metal kernels for PyTorch on macOS, bypassing generic MPS fallbacks for significantly faster computation.

Installation

pip install metalcore

Key Features

Linear Algebra

  • SVD: Jacobi algorithm, 25x faster for LLM weight matrices
  • QR: Blocked Householder, 20x faster batched
  • Eigh: Symmetric eigendecomposition, 3.5x faster
  • Cholesky: MAGMA-style, 33x faster batched
  • Solve: LU-based, 10x faster batched (fp16/bf16 supported)

Training Ops

  • RMSNorm (MetalRMSNorm): ~1.5x faster than PyTorch
  • AdamW (MetalAdamW): 2.4x faster optimizer
  • SiLU (metal_silu): 1.1x faster
  • EmbeddingBag: 6x faster (avoids CPU fallback)
  • LayerNorm, Softmax: Fused implementations

RoPE (NEW in v0.1.14)

  • apply_rotary_pos_emb: Metal-accelerated rotary embeddings (3.4x faster)
  • RotaryEmbedding: Drop-in HuggingFace replacement module
  • patch_transformers_rope: Auto-patches Llama/Mistral/Qwen models

INT4 Quantization

  • Hybrid approach (recommended): Int4Linear.from_float(linear, dequant_on_load=True)
    • Store as INT4 (7x disk compression), dequant to FP16 at load → 0.6ms matmul
  • GGML block_q4_0 (llama.cpp compatible): quantize_ggml_q4_0, matmul_ggml_q4_0
    • Ported from llama.cpp using simdgroup_multiply_accumulate
    • 4-15x overhead vs FP16 (36x faster than naive)
    • Enables larger models: 7B→3.5GB, 70B→35GB

PyTorch Integration

import metalcore

# Automatically accelerate F.silu, F.embedding_bag, torch.linalg.svd/qr
# Also replaces torch.optim.AdamW -> MetalAdamW, torch.nn.RMSNorm -> MetalRMSNorm
metalcore.enable_pytorch_overrides()

# Works seamlessly with HuggingFace models
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...", device_map="mps")

# Optional: Also patch RMSNorm and RoPE modules
metalcore.patch_transformers_rmsnorm(model)
metalcore.patch_transformers_rope(model)

Quick Start

import torch
import metalcore

device = 'mps'

# SVD
A = torch.randn(100, 50, device=device)
U, S, V = metalcore.svd(A)

# Batched QR
B = torch.randn(100, 16, 16, device=device)
Q, R = metalcore.qr(B)

# Linear Solve (fp16/bf16 supported)
A = torch.randn(100, 32, 32, device=device)
b = torch.randn(100, 32, device=device)
x = metalcore.solve(A, b)

# Training Ops
from metalcore import MetalRMSNorm, MetalAdamW, metal_gelu

norm = MetalRMSNorm(512).to(device)
x = torch.randn(32, 128, 512, device=device)
y = norm(x)

model = torch.nn.Linear(512, 256).to(device)
optimizer = MetalAdamW(model.parameters(), lr=1e-3)

y = metal_gelu(x)

Performance Highlights

Operation Speedup
RMSNorm ~1.5x
EmbeddingBag 6x (vs CPU fallback)
AdamW 2.4x
RoPE 3.4x
SiLU 1.1x
QR Batched up to 20x
SVD (large) up to 12x
Fused MLP Bwd 5-6x (vs Autograd)
Fused Attn Bwd Parity with FP16

Requirements

  • macOS 12.0+ with Apple Silicon (M1/M2/M3/M4)
  • Python 3.9 - 3.14
  • PyTorch 2.0+

Note: M3/M4 chips recommended for best bf16 performance. The library gracefully falls back to FP32 on older hardware.

Author

Kris Bailey

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metalcore-0.1.15-cp314-cp314-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

metalcore-0.1.15-cp313-cp313-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

metalcore-0.1.15-cp312-cp312-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

metalcore-0.1.15-cp311-cp311-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

metalcore-0.1.15-cp310-cp310-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

metalcore-0.1.15-cp39-cp39-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file metalcore-0.1.15-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 0c22eb105ab1e90b6875c32205034ff157737cdd92edfa4e3b5ea16d0f92fe40
MD5 56df796e6b6119070153868a0db4058f
BLAKE2b-256 a3c3d73ddffb05c66040a31e04197a0cd4c163f16984e569e1aeaa98f57b817b

See more details on using hashes here.

File details

Details for the file metalcore-0.1.15-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6c243258c441e51911b530178bfcd37ef45574b95e1ba3f3ce1f8602119bd4dd
MD5 82747fcbee74b52e513d4a60894112f2
BLAKE2b-256 11a837bc4632e4f168fae179a76f2bbed2d8fe1885b0fdf3246ccc6c9cfb64a8

See more details on using hashes here.

File details

Details for the file metalcore-0.1.15-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 aebeab519e5115e0ee55430a903103f6e7d595ef27cc8829318c78b2d0b08a6e
MD5 3abf11634259f28535dc648d1d1f5359
BLAKE2b-256 445b9ed252c69a9439111a5eb878ac4dad684b1a888fa63919502a131170ed42

See more details on using hashes here.

File details

Details for the file metalcore-0.1.15-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ffc9d61623f83daae7a6806b3a934a93a1c6b5c91c6be76daeba7b8b1b273a9e
MD5 348d86e63ae5ea0ad3c4a8f75b7e0152
BLAKE2b-256 83fa9ee27e72d4f96066e4f9af6ec9ed3f9316d14aa4ce740aac6751c3901511

See more details on using hashes here.

File details

Details for the file metalcore-0.1.15-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 760402e4f921f68d6c1a69edfc683a9db2e1d366f37ad987abdce4e24236e702
MD5 6249f8bdb30990a0b0ac23e755383758
BLAKE2b-256 24545c8a9df21610f7bfe1276768b963649d397c2477154dbec77bfabb9792d7

See more details on using hashes here.

File details

Details for the file metalcore-0.1.15-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.15-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 480cb1553811e7d1834caa9f98df652f38639ba9c75c6aa4e60fd15530a4c0e5
MD5 b3ea4b38541400d7c965eae85fde609f
BLAKE2b-256 ded1eee0dcebb9012a1ec16ad421bc5316a6a614ea5d863b72d9f2163266e953

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page