Skip to main content

Efficient optimizers

Project description

HeavyBall

[!IMPORTANT]
It's recommended to use heavyball.utils.set_torch() for faster training and less memory usage.

A simple package of efficient optimizers

The goal is not to thrive for completeness, full maintenance or abstraction, but instead to provide a simple largely static alternative to torch.optim with more and better optimizers.

Currently (2024-12-07, 1.0.0), the recommended stable optimizer is PrecondSchedulePaLMSOAP (see below). The recommended experimental optimizer is DelayedPSGDKron (tuning guide).

Features

  • Optax-like API: C = heavyball.chainable; grokfast = C.ChainOpt(p, lr, C.exp_avg, C.scale_by_adam)
  • Stochastic Rounding: FP32 convergence with BF16 parameters
  • Inplace EMA: Same math, but less memory, less compute and higher stability
  • Foreach: Fast multi-tensor application (turn it off to save memory via foreach=False)
  • PaLM Beta2: Fast initial convergence, stable late convergence
  • ScheduleFree: No learning rate schedule, but better convergence
  • Preconditioner Schedule: Improved loss-per-step in early convergence, better step-per-second in late convergence (explained below)
  • Memory-efficient storage PSGD supports store_triu_as_line (default: True) and q_dtype to trade off memory usage for memory bandwidth; Other optimizers have storage_dtype, supporting lower-precision EMAs at no(?) performance drop via stochastic rounding

Getting started

pip install heavyball
import torch
import heavyball

# Create a model
model = torch.nn.Linear(16, 1)

# Create an optimizer
optimizer = heavyball.PrecondSchedulePaLMSOAP(model.parameters(), lr=1e-3)

x = torch.randn(128, 16)
y = torch.randn(128, 1)

for _ in range(1000):
    optimizer.zero_grad()
    loss = torch.nn.functional.mse_loss(model(x), y)
    loss.backward()
    optimizer.step()

Optimizers

Name Description Advantages / Disadvantages
AdamW More efficient (speed, memory) AdamW + Faster than AdamW
+ Possibly more (numerically) stable
LaProp More efficient (speed, memory) LaProp + Same cost as AdamW
+ Marginally better converence (better proofs)
+ Higher hyperparameter stability
- Not a guaranteed win (can be neutral)
- No "Slingshot"
ADOPT More efficient (speed, memory) ADOPT + Same cost as AdamW
+ Rigorous mathematical convergence proofs, even for challenging models (GANs)
- Empirically underperforms LaProp
- no bf16
SFAdamW More efficient (speed, memory) ScheduleFree AdamW + Same cost as AdamW, but better eval perf
+ Full control over hyperparameters
PaLMSFAdamW ForeachSFAdamW with PaLM's beta2 schedule + Same cost as AdamW, but better eval perf
+ Less control, but faster early and more stable late convergence
+ ScheduleFree
- slow early convergence
SOAP More efficient (speed, memory) SOAP + Faster convergence (loss-at-step)
+ Full control over hyperparameters
- more memory usage
- more hyperparameters
- higher overhead than AdamW (can be ammortized; better loss-at-second)
PaLMSOAP ForeachSOAP with PaLM's beta2 schedule + Faster convergence (loss-at-step)
+ Less control, but faster early and more stable late convergence
- more memory usage
- more hyperparameters
- higher overhead than AdamW (can be ammortized; better loss-at-second)
SFPaLMSOAP ScheduleFree PaLMForeachSOAP + Fast convergence (loss-at-step)
+ less memory usage than PaLMForeachSOAP (more tham AdamW)
- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)
- higher overhead than AdamW (can be ammortized)
PrecondScheduleSFPaLMSOAP SFPaLMForeachSOAP with preconditioner schedule, matching the error of PrecondEvery=2 with the cost of PrecondEvery=512 + Better initial convergence than SFPaLMForeachSOAP
+ Significantly faster (sec/it) later
+ less memory usage than PaLMForeachSOAP (more tham AdamW)
- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)
- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of step
PrecondSchedulePaLMSOAP PrecondScheduleSFPaLMForeachSOAP without schedule-free + Best initial convergence
+ Significantly faster (sec/it) later
+ high stability
- more memory usage than PrecondScheduleSFPaLMForeachSOAP
- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps
PrecondScheduleSOAP PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule + Better initial convergence
+ Significantly faster (sec/it) later
- more memory usage than PrecondScheduleSFPaLMForeachSOAP
- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps

Precond Schedule

The default preconditioner schedule (f) would yield the following update intervals:

Steps Interval, f Total (schedule) Total (constant, every 2) Total (constant, every 16)
10 1.00005 10 5 (0.5x) 0 (0.0x)
100 1.026 99 50 (0.5x) 6 (0.1x)
1,000 2.0 738 500 (0.7x) 62 (0.1x)
10,000 14.3 2,168 5,000 (2.3x) 625 (0.3x)
100,000 100.2 4,049 50,000 (12.3x) 6,250 (1.5x)
1,000,000 513 7,245 500,000 (69.0x) 62,500 (8.6x)

Memory

Second order optimizers make it difficult to estimate memory usage, as it depends on shapes and hyperparameters. To estimate your memory usage, you may use test/test_memory.py which attempts to ensure there are no regressions.
Furthermore, you can find real-world memory usage of a 300M parameters video diffusion model below: img.png

PSGD

HeavyBall offers various configurations of PSGD:

  • "PSGDKron" is the baseline, equivalent to kron_torch, but with lower compute and memory overhead.
  • "PurePSGD" has no momentum, further reducing memory and compute
  • "DelayedPSGD" implements SOAP/ADOPT-style off-by-one momentum, which has worse initial convergence but higher stability img.png

Utils

To access heavyball.utils, you need to explicitly import heavyball.utils.
It has several handy functions:

  • set_torch() sets pytorch optimization settings (TF32, opt_einsum, benchmark, ...)
  • compile_mode, a string passed as-is to torch.compile(mode=compile_mode) in all compiled heavyball calls; compile_mode=None disables torch_compile
  • zeroth_power_mode, a string determining whether to use QR, newtonschulz, or svd or eigh to approximate the eigenvectors.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heavyball-1.2.1.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

heavyball-1.2.1-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file heavyball-1.2.1.tar.gz.

File metadata

  • Download URL: heavyball-1.2.1.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for heavyball-1.2.1.tar.gz
Algorithm Hash digest
SHA256 b72a41d92799dbbfc4485771965c8f51504fb20abfd30af893dddd2ddf152da6
MD5 600a5a4365ace2b1c4fd50d6a3087e4a
BLAKE2b-256 4468c210fd68d6f24d72e53d145ab630c4f2cbcb070092b57320acf46a9f9519

See more details on using hashes here.

File details

Details for the file heavyball-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: heavyball-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for heavyball-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f32560999d868b9bd1c9cdff465f98813c49789f9356d0f6fae2e5101216735
MD5 1527bafd3cd522b08b30a3c46261a59c
BLAKE2b-256 b644e36d0e68677100660aae54133f8e584c597e7e39c7289a632cea569668b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page