optimizer & lr scheduler & objective function collections in PyTorch
Project description
pytorch-optimizer
pytorch-optimizer is a production-focused optimization toolkit for PyTorch with 100+ optimizers, 10+ learning rate schedulers, and 10+ loss functions behind a consistent API.
Use it when you want fast experimentation with modern training methods without rewriting optimizer boilerplate.
Highly inspired by jettify/pytorch-optimizer.
Why pytorch-optimizer
- Broad optimizer coverage, including many recent research variants.
- Consistent loader APIs for optimizers, schedulers, and losses.
- Practical features such as
foreach,Lookahead, andGradient Centralizationintegrations. - Tested and actively maintained codebase.
- Works with optional ecosystem integrations like
bitsandbytes,q-galore-torch, andtorchao.
Installation
Requirements:
- Python
>=3.8 - PyTorch
>=1.10
pip install pytorch-optimizer
Optional integrations are not installed by default:
bitsandbytes: https://github.com/TimDettmers/bitsandbytes?tab=readme-ov-file#tldrq-galore-torch: https://github.com/VITA-Group/Q-GaLore?tab=readme-ov-file#install-q-galore-optimizertorchao: https://github.com/pytorch/ao?tab=readme-ov-file#installation
Quick Start
1) Use an optimizer class directly
from pytorch_optimizer import AdamP
model = YourModel()
optimizer = AdamP(model.parameters(), lr=1e-3)
2) Load by name
from pytorch_optimizer import load_optimizer
model = YourModel()
optimizer = load_optimizer('adamp')(model.parameters(), lr=1e-3)
3) Build with create_optimizer()
from pytorch_optimizer import create_optimizer
model = YourModel()
optimizer = create_optimizer(
model,
optimizer_name='adamp',
lr=1e-3,
weight_decay=1e-3,
use_gc=True,
use_lookahead=True,
)
4) Optional: load via torch.hub
import torch
model = YourModel()
opt_cls = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt_cls(model.parameters(), lr=1e-3)
Discover Available Components
Optimizers
from pytorch_optimizer import get_supported_optimizers
all_optimizers = get_supported_optimizers()
adam_family = get_supported_optimizers('adam*')
selected = get_supported_optimizers(['adam*', 'ranger*'])
Learning Rate Schedulers
from pytorch_optimizer import get_supported_lr_schedulers
all_schedulers = get_supported_lr_schedulers()
cosine_like = get_supported_lr_schedulers('cosine*')
Loss Functions
from pytorch_optimizer import get_supported_loss_functions
all_losses = get_supported_loss_functions()
focal_related = get_supported_loss_functions('*focal*')
Supported Optimizers
You can check the supported optimizers with below code.
from pytorch_optimizer import get_supported_optimizers
supported_optimizers = get_supported_optimizers()
or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_optimizers
get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']
get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']
| Optimizer | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| AdaBelief | Adapting Step-sizes by the Belief in Observed Gradients | github | paper(cite) |
| AdaBound | Adaptive Gradient Methods with Dynamic Bound of Learning Rate | github | paper(cite) |
| AdaHessian | An Adaptive Second Order Optimizer for Machine Learning | github | paper(cite) |
| AdamD | Improved bias-correction in Adam | paper(cite) | |
| AdamP | Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | github | paper(cite) |
| diffGrad | An Optimization Method for Convolutional Neural Networks | github | paper(cite) |
| MADGRAD | A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic | github | paper(cite) |
| RAdam | On the Variance of the Adaptive Learning Rate and Beyond | github | paper(cite) |
| Ranger | a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer | github | paper(cite) |
| Ranger21 | a synergistic deep learning optimizer | github | paper(cite) |
| Lamb | Large Batch Optimization for Deep Learning | github | paper(cite) |
| Shampoo | Preconditioned Stochastic Tensor Optimization | github | paper(cite) |
| Nero | Learning by Turning: Neural Architecture Aware Optimisation | github | paper(cite) |
| Adan | Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models | github | paper(cite) |
| Adai | Disentangling the Effects of Adaptive Learning Rate and Momentum | github | paper(cite) |
| SAM | Sharpness-Aware Minimization | github | paper(cite) |
| ASAM | Adaptive Sharpness-Aware Minimization | github | paper(cite) |
| GSAM | Surrogate Gap Guided Sharpness-Aware Minimization | github | paper(cite) |
| D-Adaptation | Learning-Rate-Free Learning by D-Adaptation | github | paper(cite) |
| AdaFactor | Adaptive Learning Rates with Sublinear Memory Cost | github | paper(cite) |
| Apollo | An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | github | paper(cite) |
| NovoGrad | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | github | paper(cite) |
| Lion | Symbolic Discovery of Optimization Algorithms | github | paper(cite) |
| Ali-G | Adaptive Learning Rates for Interpolation with Gradients | github | paper(cite) |
| SM3 | Memory-Efficient Adaptive Optimization | github | paper(cite) |
| AdaNorm | Adaptive Gradient Norm Correction based Optimizer for CNNs | github | paper(cite) |
| RotoGrad | Gradient Homogenization in Multitask Learning | github | paper(cite) |
| A2Grad | Optimal Adaptive and Accelerated Stochastic Gradient Descent | github | paper(cite) |
| AccSGD | Accelerating Stochastic Gradient Descent For Least Squares Regression | github | paper(cite) |
| SGDW | Decoupled Weight Decay Regularization | github | paper(cite) |
| ASGD | Adaptive Gradient Descent without Descent | github | paper(cite) |
| Yogi | Adaptive Methods for Nonconvex Optimization | paper(cite) | |
| SWATS | Improving Generalization Performance by Switching from Adam to SGD | paper(cite) | |
| Fromage | On the distance between two neural networks and the stability of learning | github | paper(cite) |
| MSVAG | Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | github | paper(cite) |
| AdaMod | An Adaptive and Momental Bound Method for Stochastic Learning | github | paper(cite) |
| AggMo | Aggregated Momentum: Stability Through Passive Damping | github | paper(cite) |
| QHAdam | Quasi-hyperbolic momentum and Adam for deep learning | github | paper(cite) |
| PID | A PID Controller Approach for Stochastic Optimization of Deep Networks | github | paper(cite) |
| Gravity | a Kinematic Approach on Optimization in Deep Learning | github | paper(cite) |
| AdaSmooth | An Adaptive Learning Rate Method based on Effective Ratio | paper(cite) | |
| SRMM | Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates | github | paper(cite) |
| AvaGrad | Domain-independent Dominance of Adaptive Methods | github | paper(cite) |
| PCGrad | Gradient Surgery for Multi-Task Learning | github | paper(cite) |
| AMSGrad | On the Convergence of Adam and Beyond | paper(cite) | |
| Lookahead | k steps forward, 1 step back | github | paper(cite) |
| PNM | Manipulating Stochastic Gradient Noise to Improve Generalization | github | paper(cite) |
| GC | Gradient Centralization | github | paper(cite) |
| AGC | Adaptive Gradient Clipping | github | paper(cite) |
| Stable WD | Understanding and Scheduling Weight Decay | github | paper(cite) |
| Softplus T | Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | paper(cite) | |
| Un-tuned w/u | On the adequacy of untuned warmup for adaptive optimization | paper(cite) | |
| Norm Loss | An efficient yet effective regularization method for deep neural networks | paper(cite) | |
| AdaShift | Decorrelation and Convergence of Adaptive Learning Rate Methods | github | paper(cite) |
| AdaDelta | An Adaptive Learning Rate Method | paper(cite) | |
| Amos | An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | github | paper(cite) |
| SignSGD | Compressed Optimisation for Non-Convex Problems | github | paper(cite) |
| Sophia | A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | github | paper(cite) |
| Prodigy | An Expeditiously Adaptive Parameter-Free Learner | github | paper(cite) |
| PAdam | Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | github | paper(cite) |
| LOMO | Full Parameter Fine-tuning for Large Language Models with Limited Resources | github | paper(cite) |
| AdaLOMO | Low-memory Optimization with Adaptive Learning Rate | github | paper(cite) |
| Tiger | A Tight-fisted Optimizer, an optimizer that is extremely budget-conscious | github | cite |
| CAME | Confidence-guided Adaptive Memory Efficient Optimization | github | paper(cite) |
| WSAM | Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term | github | paper(cite) |
| Aida | A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range | github | paper(cite) |
| GaLore | Memory-Efficient LLM Training by Gradient Low-Rank Projection | github | paper(cite) |
| Adalite | Adalite optimizer | github | paper(cite) |
| bSAM | SAM as an Optimal Relaxation of Bayes | github | paper(cite) |
| Schedule-Free | Schedule-Free Optimizers | github | paper(cite) |
| FAdam | Adam is a natural gradient optimizer using diagonal empirical Fisher information | github | paper(cite) |
| Grokfast | Accelerated Grokking by Amplifying Slow Gradients | github | paper(cite) |
| Kate | Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad | github | paper(cite) |
| StableAdamW | Stable and low-precision training for large-scale vision-language models | paper(cite) | |
| AdamMini | Use Fewer Learning Rates To Gain More | github | paper(cite) |
| TRAC | Adaptive Parameter-free Optimization | github | paper(cite) |
| AdamG | Towards Stability of Parameter-free Optimization | paper(cite) | |
| AdEMAMix | Better, Faster, Older | github | paper(cite) |
| SOAP | Improving and Stabilizing Shampoo using Adam | github | paper(cite) |
| ADOPT | Modified Adam Can Converge with Any β2 with the Optimal Rate | github | paper(cite) |
| FTRL | Follow The Regularized Leader | paper | |
| Cautious | Improving Training with One Line of Code | github | paper(cite) |
| DeMo | Decoupled Momentum Optimization | github | paper(cite) |
| MicroAdam | Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence | github | paper(cite) |
| Muon | MomentUm Orthogonalized by Newton-schulz | github | paper(cite) |
| LaProp | Separating Momentum and Adaptivity in Adam | github | paper(cite) |
| APOLLO | SGD-like Memory, AdamW-level Performance | github | paper(cite) |
| MARS | Unleashing the Power of Variance Reduction for Training Large Models | github | paper(cite) |
| SGDSaI | No More Adam: Learning Rate Scaling at Initialization is All You Need | github | paper(cite) |
| Grams | Gradient Descent with Adaptive Momentum Scaling | paper(cite) | |
| OrthoGrad | Grokking at the Edge of Numerical Stability | github | paper(cite) |
| Adam-ATAN2 | Scaling Exponents Across Parameterizations and Optimizers | paper(cite) | |
| SPAM | Spike-Aware Adam with Momentum Reset for Stable LLM Training | github | paper(cite) |
| TAM | Torque-Aware Momentum | paper(cite) | |
| FOCUS | First Order Concentrated Updating Scheme | github | paper(cite) |
| PSGD | Preconditioned Stochastic Gradient Descent | github | paper(cite) |
| EXAdam | The Power of Adaptive Cross-Moments | github | paper(cite) |
| GCSAM | Gradient Centralized Sharpness Aware Minimization | github | paper(cite) |
| LookSAM | Towards Efficient and Scalable Sharpness-Aware Minimization | github | paper(cite) |
| SCION | Training Deep Learning Models with Norm-Constrained LMOs | github | paper(cite) |
| COSMOS | SOAP with Muon | github | |
| StableSPAM | How to Train in 4-Bit More Stably than 16-Bit Adam | github | paper |
| AdaGC | Improving Training Stability for Large Language Model Pretraining | paper(cite) | |
| Simplified-Ademamix | Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants | github | paper(cite) |
| Fira | Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? | github | paper(cite) |
| RACS & Alice | Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension | paper(cite) | |
| VSGD | Variational Stochastic Gradient Descent for Deep Neural Networks | github | paper(cite) |
| SNSM | Subset-Norm and Subspace-Momentum: Faster Memory-Efficient Adaptive Optimization with Convergence Guarantees | github | paper(cite) |
| AdamC | Why Gradients Rapidly Increase Near the End of Training | paper(cite) | |
| AdaMuon | Adaptive Muon Optimizer | paper(cite) | |
| SPlus | A Stable Whitening Optimizer for Efficient Neural Network Training | github | paper(cite) |
| EmoNavi | An emotion-driven optimizer that feels loss and navigates accordingly | github | |
| Refined Schedule-Free | Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training | paper(cite) | |
| FriendlySAM | Friendly Sharpness-Aware Minimization | github | paper(cite) |
| AdaGO | AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates | paper(cite) | |
| Conda | Column-Normalized Adam for Training Large Language Models Faster | github | paper(cite) |
| BCOS | Stochastic Approximation with Block Coordinate Optimal Stepsizes | github | paper(cite) |
| Cautious WD | Cautious Weight Decay | paper(cite) | |
| Ano | Faster is Better in Noisy Landscape | github | paper(cite) |
| Spectral Sphere | Controlled LLM Training on Spectral Sphere | github | paper(cite) |
Supported LR Scheduler
You can check the supported learning rate schedulers with below code.
from pytorch_optimizer import get_supported_lr_schedulers
supported_lr_schedulers = get_supported_lr_schedulers()
or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_lr_schedulers
get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']
get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']
| LR Scheduler | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| Explore-Exploit | Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule | paper(cite) | |
| Chebyshev | Acceleration via Fractal Learning Rate Schedules | paper(cite) | |
| REX | Revisiting Budgeted Training with an Improved Schedule | github | paper(cite) |
| WSD | Warmup-Stable-Decay learning rate scheduler | github | paper(cite) |
Supported Loss Function
You can check the supported loss functions with below code.
from pytorch_optimizer import get_supported_loss_functions
supported_loss_functions = get_supported_loss_functions()
or you can also search them with the filter(s).
from pytorch_optimizer import get_supported_loss_functions
get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
| Loss Functions | Description | Official Code | Paper(Citation) |
|---|---|---|---|
| Label Smoothing | Rethinking the Inception Architecture for Computer Vision | paper(cite) | |
| Focal | Focal Loss for Dense Object Detection | paper(cite) | |
| Focal Cosine | Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble | paper(cite) | |
| LDAM | Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss | github | paper(cite) |
| Jaccard (IOU) | IoU Loss for 2D/3D Object Detection | paper(cite) | |
| Bi-Tempered | The Principle of Unchanged Optimality in Reinforcement Learning Generalization | paper(cite) | |
| Tversky | Tversky loss function for image segmentation using 3D fully convolutional deep networks | paper(cite) | |
| Lovasz Hinge | A tractable surrogate for the optimization of the intersection-over-union measure in neural networks | github | paper(cite) |
Documentation
- Stable docs: https://pytorch-optimizers.readthedocs.io/en/stable/
- Latest docs: https://pytorch-optimizers.readthedocs.io/en/latest/
- Optimizer API reference: https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/
- LR scheduler API reference: https://pytorch-optimizers.readthedocs.io/en/latest/lr_scheduler/
- Loss API reference: https://pytorch-optimizers.readthedocs.io/en/latest/loss/
- FAQ: https://pytorch-optimizers.readthedocs.io/en/latest/qa/
- Visualization examples: https://pytorch-optimizers.readthedocs.io/en/latest/visualization/
License Notes
Most implementations are under MIT or Apache 2.0 compatible terms from their original sources.
Some algorithms (for example Fromage, Nero) are tied to CC BY-NC-SA 4.0, which is non-commercial.
Please verify the license of each optimizer before production or commercial use.
Contributing and Community
- Contributing guide: CONTRIBUTING
- Changelog: CHANGELOG
Citation
Please cite original optimizer authors when you use specific algorithms. If you use this repository, you can use the citation metadata in CITATION or GitHub's "Cite this repository".
@software{Kim_pytorch_optimizer_optimizer_2021,
author = {Kim, Hyeongchan},
title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
url = {https://github.com/kozistr/pytorch_optimizer},
year = {2021}
}
Maintainer
Hyeongchan Kim / @kozistr
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytorch_optimizer-3.10.0.tar.gz.
File metadata
- Download URL: pytorch_optimizer-3.10.0.tar.gz
- Upload date:
- Size: 21.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da50f3700e403ac0ce774cbbfdfe94a3f39162eca65eea79d67cfde28ead642b
|
|
| MD5 |
177a4ebf458664cd38b6b2511f7a1087
|
|
| BLAKE2b-256 |
89fc313e6ad458114f3b5ad982c72343f88ef686a238787bcc39cad866a39282
|
File details
Details for the file pytorch_optimizer-3.10.0-py3-none-any.whl.
File metadata
- Download URL: pytorch_optimizer-3.10.0-py3-none-any.whl
- Upload date:
- Size: 287.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
931b0c85d665fa635c8cd4d04f9a7fd8f23b915a8e19a23b2c07341f783c5799
|
|
| MD5 |
89170d5bcf02b2b8835fd08a8d5deb31
|
|
| BLAKE2b-256 |
bde3bfc6f22bbcccf2b824c3e9d84a094d1ca9151a5d8a6c0b7990aeefa5cbb6
|