Skip to main content

Automatic learning rate optimiser based on Prodigy and Schedule-Free

Project description

Prodigy + ScheduleFree

Eliminating hyperparameters, one commit at a time.

Current status: Experimental

Installation

pip install prodigy-plus-schedule-free

Usage

from prodigyplus.prodigy_plus_schedulefree import ProdigyPlusScheduleFree
optimizer = ProdigyPlusScheduleFree(model.parameters(), lr=1.0, betas=(0.9, 0.99), beta3=None, 
                                    weight_decay=0.0, weight_decay_by_lr=True, 
				    use_bias_correction=False, d0=1e-6, d_coef=1.0, 
				    prodigy_steps=0, eps=1e-8, 
				    split_groups=True, split_groups_mean=True,
 				    factored=True, fused_back_pass=False, use_stableadamw=True,
 				    use_muon_pp=False, use_cautious=False, use_adopt=False, 
				    stochastic_rounding=True)

As with the reference implementation of schedule-free, a constant scheduler should be used, along with the appropriate calls to optimizer.train() and optimizer.eval(). See the schedule-free documentation for more details: https://github.com/facebookresearch/schedule_free

Details

An optimiser based on Prodigy that includes schedule-free logic and much, much lower memory usage, the aim being to remove the need to set any hyperparameters. Of course, that's never the case with any optimiser, but hopefully, this comes close!

Hyperparameters eliminated: Learning rate (Prodigy), LR scheduler (ScheduleFree), epsilon (Adam-atan2, optional, not enabled by default).

Based on code from:

Incorporates improvements from these pull requests (credit to https://github.com/dxqbYD, https://github.com/sangoi-exe and https://github.com/nhamanasu):

If you do use another scheduler, linear or cosine is preferred, as a restarting scheduler can confuse Prodigy's adaptation logic.

Leave lr set to 1 unless you encounter instability. Do not use with gradient clipping, as this can hamper the ability for the optimiser to predict stepsizes. Gradient clipping/normalisation is already handled in the following configurations:

  1. use_stableadamw=True,eps=1e8 (or any reasonable positive epsilon. This is the default.)
  2. eps=None (Adam-atan2, scale invariant. Will disable StableAdamW if enabled.)

By default, split_groups and split_groups_mean are set to True, so each parameter group will have its own d values, however, they will all use the harmonic mean for the dynamic learning rate. To make each group use its own dynamic LR, set split_groups_mean to False. To use the reference Prodigy behaviour where all groups are combined, set split_groups to False.

The optimiser uses low-rank approximations for the second moment, much like Adafactor. There should be little to no difference in training performance, but your mileage may vary. If you encounter problems, you can try disabling factorisation by setting factored to False.

The optimiser also supports fused backward pass to significantly lower gradient memory usage. The fused_back_pass argument must be set to True so the optimiser knows not to perform the regular step. Please note however that your training scripts / UI of choice must support the feature for generic optimisers -- as of January 2025, popular trainers such as OneTrainer and Kohya hard-code which optimisers have fused backward pass support, and so this optimiser's fused pass will not work out of the box with them.

In some scenarios, it can be advantageous to freeze Prodigy's adaptive stepsize after a certain number of steps. This can be controlled via the prodigy_steps settings. It's been suggested that all Prodigy needs to do is achieve "escape velocity" in terms of finding a good LR, which it usually achieves after ~25% of training, though this is very dependent on batch size and epochs.

This setting can be particularly helpful when training diffusion models, which have very different gradient behaviour than what most optimisers are tuned for. Prodigy in particular will increase the LR forever if it is not stopped or capped in some way (usually via a decaying LR scheduler).

Experimental features

Adam-atan2: Enabled by setting eps to None. Outlined in Scaling Exponents Across Parameterizations and Optimizers, you can use atan2 in place of the regular division plus epsilon found in most Adam-style optimisers. This makes updates scale-invariant, and removes the need to tweak the epsilon. Disabled by default.

Muon: Enabled by setting use_muon_pp to True. This changes the fundamental behaviour of the optimiser for compatible parameters from AdamW to SGD with a quasi-second moment based on the RMS of the updates. As explained by Keller Jordan, and demonstrated (in various forms) by optimisers such as Shampoo, SOAP and Jordan's Muon, applying preconditioning to the gradient can improve convergence. However, this approach may not work in some situations (small batch sizes, fine-tuning) and as such, is disabled by default.

C-Optim: Enabled by setting use_cautious to True. Outlined in Cautious Optimizers: Improving Training with One Line of Code. Applies a simple modification to parameter updates that promotes values that are aligned with the current gradient. This should result in faster convergence. While not 1:1 compatible with schedule-free, the implementation by nhamanasu does work, though improvements may be limited.

ADOPT: Enabled by setting use_adopt to True. A partial implementation of ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate, as we only update the second moment after the parameter update, so as to exclude the current gradient. Disabled by default.

Recommended usage

Earlier versions of the optimiser recommended setting prodigy_steps equal to 5-25% of your total step count, but this should not be necessary with recent updates. That said, you can still use the setting to make sure the LR does not change after a certain step, and free any memory used by Prodigy for adapting the step size.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prodigy_plus_schedule_free-1.8.1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prodigy_plus_schedule_free-1.8.1-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file prodigy_plus_schedule_free-1.8.1.tar.gz.

File metadata

File hashes

Hashes for prodigy_plus_schedule_free-1.8.1.tar.gz
Algorithm Hash digest
SHA256 9fc9603272ea37932be7bbdb1f1c65727fa62f39bf8e5468ac9a1b44714041bf
MD5 ea7761453a6e668b74c63e6b9aaf6004
BLAKE2b-256 5acd3d354ded7f8c945f8c83280f57f6c2f4c5e1b36ecb0050ca91a627494198

See more details on using hashes here.

File details

Details for the file prodigy_plus_schedule_free-1.8.1-py3-none-any.whl.

File metadata

File hashes

Hashes for prodigy_plus_schedule_free-1.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 00fecc558e87a10e4517c15880f6cc1488d9d0b115a7102bdbae0b319e69bbf6
MD5 a7871a93d70ef28ecfaff7ed1df89dac
BLAKE2b-256 329aadb9a67c1a7ff1ef118b9f28550e831df81e8d545d2a4b7b658ea5632cf0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page