Skip to main content

An implementation of PSGD Kron optimizer in PyTorch.

Project description

PSGD Kron

For original PSGD repo, see psgd_torch.

For JAX version, see psgd_jax.

Implementations of PSGD optimizers in JAX (optax-style). PSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based or whitening-based (gg^T) preconditioner and lie groups to improve training convergence, generalization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked to above for interesting details on how PSGD works and experiments using PSGD. There are also paper resources listed near the bottom of this readme.

kron:

The most versatile and easy-to-use PSGD optimizer is kron, which uses a Kronecker-factored preconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a drop-in replacement.

Installation

pip install kron-torch

Basic Usage (Kron)

Kron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up by around 4k steps.

For basic usage, use kron optimizer like any other pytorch optimizer:

from kron_torch import Kron

optimizer = Kron(params)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Basic hyperparameters:

TLDR: Learning rate and weight decay act similarly to adam's, start with adam-like settings and go from there. Maybe use slightly lower learning rate (like /2). There is no b2 or epsilon.

These next 3 settings control whether a dimension's preconditioner is diagonal or triangular. For example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256) and (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how these settings are chosen, kron can balance between memory/speed and effectiveness. Defaults lead to most precoditioners being triangular except for 1-dimensional layers and very large dimensions.

max_size_triangular: Any dimension with size above this value will have a diagonal preconditioner.

min_ndim_triangular: Any tensor with less than this number of dims will have all diagonal preconditioners. Default is 2, so single-dim layers like bias and scale will use diagonal preconditioners.

memory_save_mode: Can be None, 'one_diag', or 'all_diag'. None is default and lets all preconditioners be triangular. 'one_diag' sets the largest or last dim per layer as diagonal using np.argsort(shape)[::-1][0]. 'all_diag' sets all preconditioners to be diagonal.

preconditioner_update_probability: Preconditioner update probability uses a schedule by default that works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up by around 4k steps. PSGD generally benefits from more preconditioner updates at the start of training, but once the preconditioner is learned it's okay to do them less often. An easy way to adjust update frequency is to define your own schedule using the precond_update_prob_schedule function in kron.py (just changing the min_prob value is easiest) and pass this into kron through the preconditioner_update_probability hyperparameter.

This is the default schedule defined in the precond_update_prob_schedule function at the top of kron.py:

Default Schedule

Resources

PSGD papers and resources listed from Xi-Lin's repo

  1. Xi-Lin Li. Preconditioned stochastic gradient descent, arXiv:1512.04202, 2015. (General ideas of PSGD, preconditioner fitting losses and Kronecker product preconditioners.)
  2. Xi-Lin Li. Preconditioner on matrix Lie group for SGD, arXiv:1809.10232, 2018. (Focus on preconditioners with the affine Lie group.)
  3. Xi-Lin Li. Black box Lie group preconditioners for SGD, arXiv:2211.04422, 2022. (Mainly about the LRA preconditioner. See these supplementary materials for detailed math derivations.)
  4. Xi-Lin Li. Stochastic Hessian fittings on Lie groups, arXiv:2402.11858, 2024. (Some theoretical works on the efficiency of PSGD. The Hessian fitting problem is shown to be strongly convex on set ${\rm GL}(n, \mathbb{R})/R_{\rm polar}$.)
  5. Omead Pooladzandi, Xi-Lin Li. Curvature-informed SGD via general purpose Lie-group preconditioners, arXiv:2402.04553, 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.)

License

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kron_torch-0.2.5.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

kron_torch-0.2.5-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file kron_torch-0.2.5.tar.gz.

File metadata

  • Download URL: kron_torch-0.2.5.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for kron_torch-0.2.5.tar.gz
Algorithm Hash digest
SHA256 cc61e44f6a6bf48137eed5f74f50c36b8f948060388c3aa089c3cae0df9f7dc9
MD5 b546a6e1da7493a69b261f5ce9c2ce04
BLAKE2b-256 0d37aca2f362fe9cd1f678b07b96c6b34161a8f88552dc2d55b75fbd04fed737

See more details on using hashes here.

File details

Details for the file kron_torch-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: kron_torch-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for kron_torch-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ec628349ace6ec674c7da9f7e89553da90fb58b1839c0b98cb7a4627419b90b2
MD5 31042b2b4cfd3261d2d885ba064014c5
BLAKE2b-256 4265f87732acd78febe531a6d34dbd11037bbf44120a68f3fe7d0ec9574f65c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page