Momentum-Aligned Gradient Masking — block-wise stochastic masking wrapper for PyTorch optimizers
Project description
Magma
Momentum-Aligned Gradient Masking for Adaptive Optimizers
Magma is a lightweight wrapper that applies block-wise stochastic masking to any PyTorch optimizer, modulated by the alignment between gradient momentum and the current gradient. It is an implementation of the algorithm described in "On Surprising Effectiveness of Masking Updates in Adaptive Optimizers"(arXiv 2602.15322).
The core insight is deceptively simple. At each step, a per-parameter Bernoulli coin flip decides whether to keep or discard the update. Updates that survive are further scaled by a smoothed cosine similarity score between the gradient and its exponential moving average. The base optimizer's internal states i.e Adam's running means or RMSProp's squared gradients are always updated. Only the parameter itself is masked.
This acts as a form of implicit regularization, particularly effective under the heterogeneous curvature and heavy-tailed gradient noise characteristic of transformer training.
Installation
pip install magma-optimizer
Or directly from source:
pip install git+https://github.com/andrijdavid/magma-optimizer.git
Usage
Magma wraps any instantiated PyTorch optimizer. The interface mirrors what you already know.
from magma import Magma
import torch
model = ... # your model
base = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer = Magma(
base,
mask_prob=0.5, # prob of keeping an update
tau=2.0, # temperature for the alignment sigmoid
momentum_beta=0.9, # EMA coefficient for gradient momentum
alignment_ema=0.9, # EMA coefficient for smoothing the alignment score
exclude=set(model.embed.parameters()), # skip masking on embeddings
)
for x, y in dataloader:
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
The exclude parameter accepts a set of tensors that should bypass masking entirely. The paper recommends excluding embedding layers, as their update dynamics differ from attention and MLP blocks.
Algorithm
The procedure, applied at each step for each non-excluded parameter:
- Update momentum EMA:
μ = β·μ + (1−β)·g - Compute alignment:
s̃ = sigmoid(cosine_similarity(μ, g) / τ) - Smooth alignment:
s = 0.9·s_prev + 0.1·s̃ - Run the base optimizer step (all internal states update normally)
- Sample mask:
m ~ Bernoulli(p) - Apply:
θ = (s·m)·θ_new + (1 − s·m)·θ_old
When the mask is zero, the parameter reverts to its pre-step value. When the mask is one, the update is scaled by the alignment score. The base optimizer sees every gradient regardless.
Citation
@article{joo2026magma,
title={On Surprising Effectiveness of Masking Updates in Adaptive Optimizers},
author={Joo, Taejong and Xia, Wenhan and Kim, Cheolmin and Zhang, Ming and Ie, Eugene},
journal={arXiv preprint arXiv:2602.15322},
year={2026}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file magma_optimizer-0.1.1.tar.gz.
File metadata
- Download URL: magma_optimizer-0.1.1.tar.gz
- Upload date:
- Size: 962.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d7b53a001ccc5d7cc53a57dbcfe87feb148c40ca4ff196d2260d29bbe151a4d
|
|
| MD5 |
5bfb9e7e3bcfcd7363ea31f60a181428
|
|
| BLAKE2b-256 |
544fa440795e44cee0189db60e799553bd1b4f9ed82d438d1ecd64eafcac8128
|
Provenance
The following attestation bundles were made for magma_optimizer-0.1.1.tar.gz:
Publisher:
publish.yml on andrijdavid/magma-optimizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
magma_optimizer-0.1.1.tar.gz -
Subject digest:
2d7b53a001ccc5d7cc53a57dbcfe87feb148c40ca4ff196d2260d29bbe151a4d - Sigstore transparency entry: 1239327294
- Sigstore integration time:
-
Permalink:
andrijdavid/magma-optimizer@db871171fdbc3e499a89fb6d52fda762c5907a42 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/andrijdavid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db871171fdbc3e499a89fb6d52fda762c5907a42 -
Trigger Event:
push
-
Statement type:
File details
Details for the file magma_optimizer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: magma_optimizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4832a144335dd1c732902f94d7dea678eaec553036ec37de679555abc0f6bf72
|
|
| MD5 |
56fc453bf27ffa9a1663f1be2cd1bd41
|
|
| BLAKE2b-256 |
c34e317411bea664ab04e553fb8a2b9976891217eede59906f48dabd4fa06170
|
Provenance
The following attestation bundles were made for magma_optimizer-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on andrijdavid/magma-optimizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
magma_optimizer-0.1.1-py3-none-any.whl -
Subject digest:
4832a144335dd1c732902f94d7dea678eaec553036ec37de679555abc0f6bf72 - Sigstore transparency entry: 1239327297
- Sigstore integration time:
-
Permalink:
andrijdavid/magma-optimizer@db871171fdbc3e499a89fb6d52fda762c5907a42 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/andrijdavid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db871171fdbc3e499a89fb6d52fda762c5907a42 -
Trigger Event:
push
-
Statement type: