Automatic learning rate optimiser based on Prodigy and Schedule-Free

These details have not been verified by PyPI

Project links

Homepage

Project description

Prodigy + Schedule-Free

Eliminating hyperparameters, one commit at a time.

Current status: Experimental

Installation

pip install prodigy-plus-schedule-free

Usage

from prodigyplus.prodigy_plus_schedulefree import ProdigyPlusScheduleFree
optimizer = ProdigyPlusScheduleFree(model.parameters(), lr=1.0, betas=(0.9, 0.99), beta3=None, 
                 					weight_decay=0.0, weight_decay_by_lr=True, d0=1e-6, d_coef=1.0,
							d_limiter=True,	prodigy_steps=0, eps=1e-8, 
							split_groups=True, split_groups_mean=False,
                 					factored=True, factored_fp32=True, use_bias_correction=False,
                 					use_stableadamw=True, use_schedulefree=True, use_speed=False,
                 					stochastic_rounding=True, fused_back_pass=False,
                 					use_cautious=False, use_grams=False, use_adopt=False,
							use_orthograd=False, use_focus=False)

As with the reference implementation of Schedule-Free, a constant scheduler should be used, along with the appropriate calls to optimizer.train() and optimizer.eval(). See the Schedule-Free documentation for more details: https://github.com/facebookresearch/schedule_free

TLDR

The default settings should "just work", but there are a few configurations you can try to improve things.

Gradient scaling/clipping

By default, the optimiser uses StableAdamW to scale parameter updates, which reduces the need for external gradient scaling or clipping. However, this can also hamper Prodigy's ability to adapt the stepsize. While the optimiser includes internal logic to mostly mitigate this, you can set use_stableadamw=False and use external gradient clipping instead.

Training multiple networks

Unlike reference Prodigy, this optimiser will adjust the stepsize per parameter group, allowing one to train multiple networks at the same time. To use the original behaviour, set split_groups=False.

Turning off Prodigy

Earlier versions of the optimiser recommended setting prodigy_steps equal to 5-25% of your total step count, but this should not be necessary with recent updates. That said, you can still use the setting to make sure the LR does not change after a certain step, and free any memory used by Prodigy for adapting the step size.

Changes in v2.0.0

Schedule-Free can be disabled using use_schedulefree=False. This reverts the optimiser to straight Prodigy, while keeping per-group learning rates and the rest of the features of the optimiser (StableAdamW, factorisation, and so on). In this mode, it is best paired with a decaying LR scheduler.
Changed split_groups_mean to False so full, per-group stepsize adaptation is active by default.
The Prodigy implementation adjusted to more closely match the original.
StableAdamW use a soft scaling formula based on the square root of the RMS. This should result in more accurate LR adjustments.
SPEED has been completely reworked, and should be more stable and perform better on a wide range of tasks. Personally, I now prefer it over base Prodigy.
Removed Muon. It never really worked correctly when combined with Schedule-Free and Prodigy.
Removed the "confidence" learning rate limiter, which ended up being too aggressive for non-SDXL training and fine-tuning.
Added a limiter to d growth to prevent over-estimated LRs when gradients and EMAs are still stabilising. It can be disabled via d_limiter=False.
Added logging group parameter effective_lr. This value is for reporting only; rather than using d * lr, you can track d * effective_lr. This provides a closer approximation of the LR when Schedule-Free is on. Once the LR has settled, d * effective_lr should be around 10% the size of d * lr.
Sufficied to say, you should not resume training started with older versions of the optimiser with this one. It will break.

Details

An optimiser based on Prodigy that includes Schedule-Free logic and much, much lower memory usage, the aim being to remove the need to set any hyperparameters. Of course, that's never the case with any optimiser, but hopefully, this comes close!

Hyperparameters eliminated: Learning rate (Prodigy), LR scheduler (Schedule-Free), epsilon (Adam-atan2, optional, not enabled by default).

Based on code from:

Incorporates improvements from these pull requests (credit to https://github.com/dxqbYD, https://github.com/sangoi-exe and https://github.com/nhamanasu):

If you do use another scheduler, linear or cosine is preferred, as a restarting scheduler can confuse Prodigy's adaptation logic.

Leave lr set to 1 unless you encounter instability. Do not use with gradient clipping, as this can hamper the ability for the optimiser to predict stepsizes. Gradient clipping/normalisation is already handled in the following configurations:

use_stableadamw=True,eps=1e8 (or any reasonable positive epsilon. This is the default.)
eps=None (Adam-atan2, scale invariant. Will disable StableAdamW if enabled.)

The optimiser uses low-rank approximations for the second moment, much like Adafactor. There should be little to no difference in training performance, but your mileage may vary. If you encounter problems, you can try disabling factorisation by setting factored=False. If you're training in bfloat16, and need to squeeze out every last drop of memory, you can also set factored_fp32=False, which will make the factored second moment use the same precision as the weights, rather than float32 (to maximise stability).

The optimiser also supports fused backward pass to significantly lower gradient memory usage. The fused_back_pass argument must be set to True so the optimiser knows not to perform the regular step. Please note however that your training scripts / UI of choice must support the feature for generic optimisers -- as of May 2025, Kohya hard-codes which optimisers have fused backward pass support, and so this optimiser's fused pass will not work out of the box with it.

In some scenarios, it can be advantageous to freeze Prodigy's adaptive stepsize after a certain number of steps. This can be controlled via the prodigy_steps settings. It's been suggested that all Prodigy needs to do is achieve "escape velocity" in terms of finding a good LR, which it usually achieves after ~25% of training, though this is very dependent on batch size and epochs.

This setting can be particularly helpful when training diffusion models, which have very different gradient behaviour than what most optimisers are tuned for. Prodigy in particular will increase the LR forever if it is not stopped or capped in some way (usually via a decaying LR scheduler). Even if you don't need to cap LR growth, the optimiser will free all Prodigy-specific state memory once prodigy_steps is exceeded, which may improve performance where memory usage is on the borderline.

Experimental features

Adam-atan2: eps=None. Outlined in Scaling Exponents Across Parameterizations and Optimizers, you can use atan2 in place of the regular division plus epsilon found in most Adam-style optimisers. This makes updates scale-invariant, and removes the need to tweak the epsilon. Disabled by default.

C-Optim: use_cautious=True. Outlined in Cautious Optimizers: Improving Training with One Line of Code. Applies a simple modification to parameter updates that promotes values that are aligned with the current gradient. This should result in faster convergence. While not 1:1 compatible with schedule-free, the implementation by nhamanasu does work, though improvements may be limited.

Grams: use_grams=True. Described in Grams: Gradient Descent with Adaptive Momentum Scaling. In a similar vein to C-Optim, the parameter update is modified to separate the update direction from momentum. Thanks to gesen2egee for the pull request.

ADOPT: use_adopt=True. A partial implementation of ADOPT: Modified Adam Can Converge with Any Î²2 with the Optimal Rate, as we only update the second moment after the parameter update, so as to exclude the current gradient. Disabled by default.

OrthoGrad: use_orthograd=True. Updates weights using the component of the gradient that is orthogonal to the current weight direction, as described in Grokking at the Edge of Numerical Stability. Can help prevent overfitting and improve generalisation. Ignored for parameters using Muon.

FOCUS: use_focus=True. Modifies the update step to better handle noise at large step sizes. From FOCUS: First-Order Concentrated Update Scheme. This method is incompatible with factorisation (which will increase state memory usage), Muon and Adam-atan2. Additionally, Prodigy modifies the second moment updates when d changes, which may limit the benefits of this method.

SPEED: use_speed=True. Something of my own creation I've dubbed Simplified Prodigy with rElativE D. It replaces Prodigy's numerator/denominator ratio with a momentum-based estimate of directional progress. SPEED uses less memory, is scale-insensitive, and can be a better choice when training multiple networks, however, it can be unstable when used with weight decay or for extremely long training runs (where it's recommended to use prodigy_steps).

Prodigy FAQ

Q: Why doesn't Prodigy ever lower the learning rate?

The original Prodigy's aim is not to act as a combined learning rate calculator and scheduler. It's meant to ballpark a good learning rate, and leave LR decay to your preferred scheduler (usually cosine). Prodigy + Schedule-Free does combine the two, but it doesn't adjust the LR directly -- in simple terms, it uses a smaller and smaller portion of the averaged updates as training goes on, roughly approximating a 1/t schedule.

Looking at d alone tells only parts of the story; this is just the LR Prodigy has calculated, minus any internal modifications. A better metric is observing the norm of the weights, you'll see their rate of growth decrease significantly over time, reflecting the small tail of a traditional LR schedule.

Q: Why isn't Prodigy increasing the LR?

If Prodigy fails to increase the LR over an extended period (say 100 or more steps), and you're not using bias correction, non-constant LR scheduler or warmup, this usually indicates one of the following:

You haven't set the optimiser's lr argument to 1. For compatibility with external LR schedulers, the optimiser will multiple the LR you provide with the adaptive one, so if you forget to change this when switching optimisers, the LR will be tiny.
The ideal LR is less than d0 (Prodigy's initial LR guess). Try setting d0 to a lower value, such as 1e-7 or 1e-8. If this doesn't help, you can also try setting d_coef=2 (or higher), or use_speed=True.
The value for d0 is too conservative and starving Prodigy. Try raising d0 to 1e-5 or 1e-4.
External gradient clipping is enabled. This optimiser handles gradient scaling already, so turn off any external clipping/scaling. Alternatively, you can use external scaling, and disable the internal one via use_stableadamw=False.
Set d_limiter=False. The growth limiter should never prevent the LR from increasing, but it's possible your training scenario requires faster adjustments.

MNIST results

Generated from the MNIST example in the Schedule-Free repository, using the default settings.

Prodigy LR: 0.000862
Test set: Average loss: 0.0456, Accuracy: 9849/10000 (98.49%)
Test set: Average loss: 0.0347, Accuracy: 9881/10000 (98.81%)
Test set: Average loss: 0.0324, Accuracy: 9898/10000 (98.98%)
Test set: Average loss: 0.0308, Accuracy: 9911/10000 (99.11%)
Test set: Average loss: 0.0299, Accuracy: 9913/10000 (99.13%)
Test set: Average loss: 0.0285, Accuracy: 9919/10000 (99.19%)
Test set: Average loss: 0.0289, Accuracy: 9922/10000 (99.22%)
Test set: Average loss: 0.0300, Accuracy: 9925/10000 (99.25%)
Test set: Average loss: 0.0306, Accuracy: 9924/10000 (99.24%)
Test set: Average loss: 0.0319, Accuracy: 9927/10000 (99.27%)
Test set: Average loss: 0.0339, Accuracy: 9925/10000 (99.25%)
Test set: Average loss: 0.0349, Accuracy: 9928/10000 (99.28%)
Test set: Average loss: 0.0366, Accuracy: 9924/10000 (99.24%)
Test set: Average loss: 0.0377, Accuracy: 9926/10000 (99.26%)

With use_speed=True:

Prodigy LR: 0.002582
Test set: Average loss: 0.0401, Accuracy: 9861/10000 (98.61%)
Test set: Average loss: 0.0309, Accuracy: 9908/10000 (99.08%)
Test set: Average loss: 0.0276, Accuracy: 9916/10000 (99.16%)
Test set: Average loss: 0.0259, Accuracy: 9928/10000 (99.28%)
Test set: Average loss: 0.0258, Accuracy: 9930/10000 (99.30%)
Test set: Average loss: 0.0268, Accuracy: 9931/10000 (99.31%)
Test set: Average loss: 0.0288, Accuracy: 9926/10000 (99.26%)
Test set: Average loss: 0.0305, Accuracy: 9927/10000 (99.27%)
Test set: Average loss: 0.0309, Accuracy: 9934/10000 (99.34%)
Test set: Average loss: 0.0309, Accuracy: 9932/10000 (99.32%)
Test set: Average loss: 0.0323, Accuracy: 9933/10000 (99.33%)
Test set: Average loss: 0.0337, Accuracy: 9934/10000 (99.34%)
Test set: Average loss: 0.0345, Accuracy: 9932/10000 (99.32%)
Test set: Average loss: 0.0352, Accuracy: 9933/10000 (99.33%)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.1

Sep 27, 2025

2.0.0

Sep 13, 2025

2.0.0rc2 pre-release

May 30, 2025

2.0.0rc1 pre-release

May 19, 2025

2.0.0b5 pre-release

May 14, 2025

This version

2.0.0b4 pre-release

May 13, 2025

2.0.0b3 pre-release

May 12, 2025

2.0.0b2 pre-release

May 12, 2025

2.0.0b1 pre-release

May 11, 2025

1.9.2

May 13, 2025

1.9.1

Mar 25, 2025

1.9.0

Jan 30, 2025

1.8.51

Jan 13, 2025

1.8.33

Jan 9, 2025

1.8.32

Jan 9, 2025

1.8.31

Jan 9, 2025

1.8.21

Jan 7, 2025

1.8.5

Jan 13, 2025

1.8.4

Jan 10, 2025

1.8.3

Jan 9, 2025

1.8.2

Jan 7, 2025

1.8.1

Jan 6, 2025

1.8.0

Dec 17, 2024

1.7.0

Dec 3, 2024

1.6.31

Nov 28, 2024

1.6.3

Nov 28, 2024

1.6.2

Nov 26, 2024

1.6.1

Nov 26, 2024

1.6.0

Nov 26, 2024

1.5.1

Nov 26, 2024

1.5.0

Nov 24, 2024

1.4.2

Nov 23, 2024

1.4.1

Nov 23, 2024

1.4.0

Nov 23, 2024

1.3.2

Nov 22, 2024

1.3.1 yanked

Nov 22, 2024

Reason this release was yanked:

Bugged release.

1.3.0

Nov 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prodigy_plus_schedule_free-2.0.0b4.tar.gz (23.0 kB view details)

Uploaded May 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prodigy_plus_schedule_free-2.0.0b4-py3-none-any.whl (23.8 kB view details)

Uploaded May 13, 2025 Python 3

File details

Details for the file prodigy_plus_schedule_free-2.0.0b4.tar.gz.

File metadata

Download URL: prodigy_plus_schedule_free-2.0.0b4.tar.gz
Upload date: May 13, 2025
Size: 23.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for prodigy_plus_schedule_free-2.0.0b4.tar.gz
Algorithm	Hash digest
SHA256	`5aaf868d186b139f52e7eefc681a94f784e7d7c54edd9b2e7ec9262fa7d02e04`
MD5	`a74ce29c0e01fa3dccaa7f5c8ee847c8`
BLAKE2b-256	`fe90b0f648f7a723f350adb08b1eef1f9f220c768faba5480896d52b522e862d`

See more details on using hashes here.

File details

Details for the file prodigy_plus_schedule_free-2.0.0b4-py3-none-any.whl.

File metadata

Download URL: prodigy_plus_schedule_free-2.0.0b4-py3-none-any.whl
Upload date: May 13, 2025
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for prodigy_plus_schedule_free-2.0.0b4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`676c89cac70b519d7e2b270b4e3e6a787c67d69b095bd46e711640963f97750f`
MD5	`dead27347a84b2ad650d3a1580e1d968`
BLAKE2b-256	`f9cde195cef804c8c125e5a4dce206a718f243b99a32a92ffc4cc65e6637d07f`

See more details on using hashes here.

prodigy-plus-schedule-free 2.0.0b4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Prodigy + Schedule-Free

Installation

Usage

TLDR

Gradient scaling/clipping

Training multiple networks

Turning off Prodigy

Changes in v2.0.0

Details

Experimental features

Prodigy FAQ

MNIST results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes