Skip to main content

Foundational Metal Linear Algebra Primitives for PyTorch

Project description

metalcore

High-performance Metal-accelerated linear algebra and training operations for PyTorch on Apple Silicon.

Overview

metalcore provides optimized custom Metal kernels for PyTorch on macOS, bypassing generic MPS fallbacks for significantly faster computation.

Installation

pip install metalcore

Key Features

Linear Algebra

  • SVD: Jacobi algorithm, 25x faster for LLM weight matrices
  • QR: Blocked Householder, 20x faster batched
  • Eigh: Symmetric eigendecomposition, 3.5x faster
  • Cholesky: MAGMA-style, 33x faster batched
  • Solve: LU-based, 10x faster batched (fp16/bf16 supported)

Training Ops

  • RMSNorm (MetalRMSNorm): ~1.5x faster than PyTorch
  • AdamW (MetalAdamW): 2.4x faster optimizer
  • SiLU (metal_silu): 1.1x faster
  • EmbeddingBag: 6x faster (avoids CPU fallback)
  • LayerNorm, Softmax: Fused implementations

RoPE

  • apply_rotary_pos_emb: Metal-accelerated rotary embeddings (3.4x faster)
  • RotaryEmbedding: Drop-in HuggingFace replacement module
  • patch_transformers_rope: Auto-patches Llama/Mistral/Qwen models

INT4 Quantization

  • Hybrid approach (recommended): Int4Linear.from_float(linear, dequant_on_load=True)
    • Store as INT4 (7x disk compression), dequant to FP16 at load → 0.6ms matmul
  • GGML block_q4_0 (llama.cpp compatible): quantize_ggml_q4_0, matmul_ggml_q4_0
    • Ported from llama.cpp using simdgroup_multiply_accumulate
    • 4-15x overhead vs FP16 (36x faster than naive)
    • Enables larger models: 7B→3.5GB, 70B→35GB

PyTorch Integration

import metalcore

# Automatically accelerate F.silu, F.embedding_bag, torch.linalg.svd/qr
# Also replaces torch.optim.AdamW -> MetalAdamW, torch.nn.RMSNorm -> MetalRMSNorm
metalcore.enable_pytorch_overrides()

# Works seamlessly with HuggingFace models
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...", device_map="mps")

# Optional: Also patch RMSNorm and RoPE modules
metalcore.patch_transformers_rmsnorm(model)
metalcore.patch_transformers_rope(model)

Quick Start

import torch
import metalcore

device = 'mps'

# SVD
A = torch.randn(100, 50, device=device)
U, S, V = metalcore.svd(A)

# Batched QR
B = torch.randn(100, 16, 16, device=device)
Q, R = metalcore.qr(B)

# Linear Solve (fp16/bf16 supported)
A = torch.randn(100, 32, 32, device=device)
b = torch.randn(100, 32, device=device)
x = metalcore.solve(A, b)

# Training Ops
from metalcore import MetalRMSNorm, MetalAdamW, metal_gelu

norm = MetalRMSNorm(512).to(device)
x = torch.randn(32, 128, 512, device=device)
y = norm(x)

model = torch.nn.Linear(512, 256).to(device)
optimizer = MetalAdamW(model.parameters(), lr=1e-3)

y = metal_gelu(x)

Performance Highlights

Operation Speedup
RMSNorm ~1.5x
EmbeddingBag 6x (vs CPU fallback)
AdamW 2.4x
RoPE 3.4x
SiLU 1.1x
QR Batched up to 20x
SVD (large) up to 12x
Fused MLP Bwd 5-6x (vs Autograd)
Fused Attn Bwd Parity with FP16

Requirements

  • macOS 12.0+ with Apple Silicon (M1/M2/M3/M4)
  • Python 3.9 - 3.14
  • PyTorch 2.0+

Note: M3/M4 chips recommended for best bf16 performance. The library gracefully falls back to FP32 on older hardware.

Author

Kris Bailey

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metalcore-0.1.16.tar.gz (2.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

metalcore-0.1.16-cp314-cp314-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

metalcore-0.1.16-cp313-cp313-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

metalcore-0.1.16-cp312-cp312-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

metalcore-0.1.16-cp311-cp311-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

metalcore-0.1.16-cp310-cp310-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

metalcore-0.1.16-cp39-cp39-macosx_15_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file metalcore-0.1.16.tar.gz.

File metadata

  • Download URL: metalcore-0.1.16.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for metalcore-0.1.16.tar.gz
Algorithm Hash digest
SHA256 86ee276ffc2d47872e7f157f58b95f5e5e70cf1f2b69a7d610692fe629caecab
MD5 f5068de692a3101a4dbf7fefc5588990
BLAKE2b-256 c18bf26150e20f208c658829bbcd92001bdcb0cfa5cfb55a561e9dc924d9ce26

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 412ae0da642d83d6733bd5273e5c3a7b498d81421520cc3d144018265200d3a9
MD5 967873fdba7da42a5274337523bcb84a
BLAKE2b-256 4c444876b7ef59c7e467c43f76255088bd9175b8c2fc09c253285370104b68bc

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d1aab917c3a1fcc3d8961d2c8683a57595fd9da1fa00e8708fecd199625226e2
MD5 0bd74951550fa22997adc3d872c2b85f
BLAKE2b-256 ba255e2ea5a0f78d9ba717e9cee0cd5dd9a6c14252582e7b6cc026eb2463ef78

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6fff1c94f2c5ea9ec72410893fec2bdc5a907384911c1ed22bf16b29348c1d08
MD5 9615aca3876a4bb3d1a960740c54ea42
BLAKE2b-256 20c23fa58e856d4d6f4a55124403a1f7080376b637ce65f21b9fc19b4a9e3f1f

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ca012b74861aa68fdabff372163be12bf4f0d3f7500dbbb913eeb968c9a87c7f
MD5 fa3ee8442bb01a98fb3c9f7cc3dfb0d7
BLAKE2b-256 de101076e55f2793a46091c2e2c85d4ecafc918d46101214169125dac8ccea80

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5a608e8783cf16fde8bc6231adebce1552991c807a2dfcd8d0d72b20bde76b4e
MD5 b3f88fb3b53ba3e3427b6e1580a17633
BLAKE2b-256 4362148e34fafb6c93afbc6bb00c24f6ae8b9eb6079aaaede2be0c27f0ac6e6e

See more details on using hashes here.

File details

Details for the file metalcore-0.1.16-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for metalcore-0.1.16-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 be2e7b43d3190cc6ec9f78031e2b58736afc554153021f0df3ce30f2bbb83e2c
MD5 e5dca136258d6a4f1c2d60ed5fcadc54
BLAKE2b-256 4802f872a429dbde9f881cf0dbe9211d1c6884edc5e2612f62b5d095ce7c1e6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page