Skip to main content

Foundational Metal Linear Algebra Primitives for PyTorch

Project description

metalcore

High-performance Metal-accelerated linear algebra and training operations for PyTorch on Apple Silicon.

Overview

metalcore provides optimized custom Metal kernels for PyTorch on macOS, bypassing generic MPS fallbacks for significantly faster computation.

Installation

pip install metalcore

Key Features

Linear Algebra

  • SVD: Jacobi algorithm, 25x faster for LLM weight matrices
  • QR: Blocked Householder, 20x faster batched
  • Eigh: Symmetric eigendecomposition, 3.5x faster
  • Cholesky: MAGMA-style, 33x faster batched
  • Solve: LU-based, 10x faster batched (fp16/bf16 supported)

Training Ops

  • RMSNorm (MetalRMSNorm): ~1.5x faster than PyTorch!
  • AdamW (MetalAdamW): 2.4x faster optimizer
  • SiLU (metal_silu): 1.1x faster
  • EmbeddingBag: 6x faster (avoids CPU fallback)
  • LayerNorm, Softmax: Fused implementations

RoPE (NEW in v0.1.14)

  • apply_rotary_pos_emb: Metal-accelerated rotary embeddings (3.4x faster)
  • RotaryEmbedding: Drop-in HuggingFace replacement module
  • patch_transformers_rope: Auto-patches Llama/Mistral/Qwen models

INT4 Quantization

  • Hybrid approach (recommended): Int4Linear.from_float(linear, dequant_on_load=True)
    • Store as INT4 (7x disk compression), dequant to FP16 at load → 0.6ms matmul
  • GGML block_q4_0 (llama.cpp compatible): quantize_ggml_q4_0, matmul_ggml_q4_0
    • Ported from llama.cpp using simdgroup_multiply_accumulate
    • 4-15x overhead vs FP16 (36x faster than naive)
    • Enables larger models: 7B→3.5GB, 70B→35GB

PyTorch Integration

import metalcore

# Automatically accelerate F.silu, F.gelu, F.embedding_bag, torch.linalg.svd/qr
metalcore.enable_pytorch_overrides()

# Works seamlessly with HuggingFace models
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...", device_map="mps")

# Optional: Also patch RMSNorm and RoPE modules
metalcore.patch_transformers_rmsnorm(model)
metalcore.patch_transformers_rope(model)

Quick Start

import torch
import metalcore

device = 'mps'

# SVD
A = torch.randn(100, 50, device=device)
U, S, V = metalcore.svd(A)

# Batched QR
B = torch.randn(100, 16, 16, device=device)
Q, R = metalcore.qr(B)

# Linear Solve (fp16/bf16 supported)
A = torch.randn(100, 32, 32, device=device)
b = torch.randn(100, 32, device=device)
x = metalcore.solve(A, b)

# Training Ops
from metalcore import MetalRMSNorm, MetalAdamW, metal_gelu

norm = MetalRMSNorm(512).to(device)
x = torch.randn(32, 128, 512, device=device)
y = norm(x)

model = torch.nn.Linear(512, 256).to(device)
optimizer = MetalAdamW(model.parameters(), lr=1e-3)

y = metal_gelu(x)

Performance Highlights

Operation Speedup
RMSNorm ~1.5x
EmbeddingBag 6x (vs CPU fallback)
AdamW 2.4x
RoPE 3.4x
SiLU 1.1x
QR Batched up to 20x
SVD (large) up to 12x
Fused MLP Bwd 5-6x (vs Autograd)
Fused Attn Bwd Parity with FP16

Requirements

  • macOS 12.0+ with Apple Silicon (M1/M2/M3/M4)
  • Python 3.9 - 3.14
  • PyTorch 2.0+

Note: M3/M4 chips recommended for best bf16 performance. The library gracefully falls back to FP32 on older hardware.

Author

Kris Bailey

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metalcore-0.1.14-cp314-cp314-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

metalcore-0.1.14-cp313-cp313-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

metalcore-0.1.14-cp312-cp312-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

metalcore-0.1.14-cp311-cp311-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

metalcore-0.1.14-cp310-cp310-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

metalcore-0.1.14-cp39-cp39-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file metalcore-0.1.14-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 efb0db2b3fff898143be6198c9c41bdcf05d19f0fb8fbd2610ab19cf144066b6
MD5 9b358ec9fae84d61cc10c65067c3d823
BLAKE2b-256 80383e2b84a76d2ddb6c31e6728fa08d1c730d7760602b7fee5f8b1e75f6b5ab

See more details on using hashes here.

File details

Details for the file metalcore-0.1.14-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 51b08ccaaab03341439dd70c03fe17dcfeaa23f68da8a2733742ca53a435bdc3
MD5 434fd8e876461d75f2802a98f31d6f65
BLAKE2b-256 2854d7fe27d51bfa307d0f707a2ee2b13bb6690274bc98f6f024f815806e27b9

See more details on using hashes here.

File details

Details for the file metalcore-0.1.14-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d55370cbe73329d801cdfe5776a84701a9231d5ca7cc88f88f9de860b57848a8
MD5 f530f46afdcbf3416b3bcedbb4a1886f
BLAKE2b-256 28b34ca0608a916948af82c65be27da17b4f49e251339cf44dedb1c29a871bb2

See more details on using hashes here.

File details

Details for the file metalcore-0.1.14-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 03b3cdd40ff5b79c2ea4f492d6d34a6ce5fa949daac64baa7d92182e906f1636
MD5 802b970cb506e07d0b0e97ca01705118
BLAKE2b-256 840540d84c545944f33f5d355ea8477ddd94fa2a7b965a35dd3349f3607a1349

See more details on using hashes here.

File details

Details for the file metalcore-0.1.14-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6c060e00abece8605d0ea456c604b34ce3628cf2090a9bb37555c6d039eb1d08
MD5 ab3402eece30112bb7c08858232fd0ba
BLAKE2b-256 1f9c265e02e0cea4a952f9623b667087c25bf2a4ed0fca4cd2dd97ee15f917bd

See more details on using hashes here.

File details

Details for the file metalcore-0.1.14-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.14-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 15ac1c7bd965f093aad09cb439ba7ee84bcee3ff01d5feaa4dafcad44dbcba3c
MD5 da34089b5f0b596152e018119d516c71
BLAKE2b-256 af9de7da97efabd4501540584cdb0ede3dd109c4d45e61e0df79f20b2eeebd59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page