Skip to main content

A family of highly efficient, lightweight yet powerful optimizers.

Project description

Advanced Optimizers (AIO)

A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.

PyPI

🔥 What's New

in 2.1.x

  • Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
  • More info coming soon.

in 2.0.x

  • Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
  • Better and improved 1-bit factored mode via (nnmf_factor=True).
  • Various improvements across the optimizers.

in 1.2.x

  • Added advanced variants of Muon optimizer with features and settings from recent papers.
Optimizer Description
Muon_adv Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features.
AdaMuon_adv Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization.

Documentation coming soon.

  • Implemented Cautious Weight Decay for all advanced optimizers.

  • Improved parameter update and weight decay for BF16 with stochastic rounding. The updates are now accumulated in float32 and rounded once at the end.

  • Use fused and in-place operations whenever possible for all advanced optimizers.

  • Prodigy variants are now 50% faster by avoiding CUDA syncs. Thanks to @dxqb!


📦 Installation

pip install adv_optm

🧠 Core Innovations

This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:

Memory-Efficient Optimization (SMMF-inspired)

  • Paper: SMMF: Square-Matricized Momentum Factorization
  • Approach: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
  • Innovation:
    • First moment split into 1-bit sign + absolute value
    • Final storage: four factored vectors + one 1-bit sign state
    • Preserves Adam-like update quality with drastically reduced memory

⚡ Performance Characteristics

Memory Efficiency (SDXL Model – 6.5GB)

Optimizer Memory Usage Description
Adopt_Factored 328 MB 4 small vectors + 1-bit state
Adopt_Factored + AdEMAMix 625 MB 6 small vectors + two 1-bit states

Speed Comparison (SDXL, Batch Size 4)

Optimizer Speed Notes
Adafactor ~8.5s/it Baseline
Adopt_Factored ~10s/it +18% overhead from compression
Adopt_Factored + AdEMAMix ~12s/it +41% overhead (3 factored states)

🧪 Available Optimizers

Standard Optimizers (All support factored=True/False)

Optimizer Description Best For
Adam_Adv Advanced Adam implementation General purpose
Adopt_Adv Adam-variant with independent beta2 Stable training for small batch size regimes
Prodigy_Adv Prodigy with D-Adaptation Adam with automatic LR tuning
Lion_Adv Advanced Lion implementation Memory-constrained environments
Prodigy_Lion_Adv Prodigy + Lion combination Lion with automatic LR tuning

⚙️ Feature Matrix

Feature Adam_Adv Adopt_Adv Prodigy_Adv Lion_Adv
Factored ✓ ✓
OrthoGrad
atan2
Stochastic Rounding
Fused Backward Pass
Kourkoutas-β

🛠️ Comprehensive Feature Guide

A. Universal Safe Features

These features work with all optimizers and are generally safe to enable.

Feature Description Recommended Usage Performance Impact Theoretical Basis Compatibility
Fused Back Pass Fuses backward pass; gradients used immediately and memory freed on-the-fly Memory-constrained environments Reduces peak memory Memory optimization All optimizers
Stochastic Rounding Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 BF16 training Minimal overhead (<5%) Revisiting BFloat16 Training All optimizers
OrthoGrad Removes gradient component parallel to weights to reduce overfitting Full fine-tuning without weight decay +33% time overhead (BS=4); less at larger BS Grokking at Edge All optimizers
Factored Memory-efficient optimization via rank-1 1-bit factorization of optimizer states Large models / memory-limited hardware Adds compression overhead SMMF All optimizers

B. Individual Features

Feature Description Recommended Usage Performance Impact Theoretical Basis Compatibility
atan2 Robust epsilon replacement with built-in gradient clipping Use for stable bounded updates (or for Adopt as it needs that) No overhead Adam-atan2 Adam/Adopt/Prodigy
Kourkoutas-β Layer-wise adaptive β₂ based on gradient “sunspike” ratio Noisy/small/large-batch/high-LR training No overhead Kourkoutas-β Adam/Adopt/Prodigy

🔍 Feature Deep Dives

atan2

  • Replaces eps in Adam-family optimizers with a scale-invariant, bounded update rule.
  • Automatically clips updates to [-2, 2], preventing destabilizing jumps.
  • Highly recommended for Adopt_Adv, which is prone to instability without clipping.

📚 Reference:


Kourkoutas-β

Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv.

Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:

  • During gradient bursts → β₂ ↓ toward Lower β₂ → faster reaction
  • During calm phases → β₂ ↑ toward The Selected β₂ → stronger smoothing

This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.

Pros/Cons

Category Details
Pros Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction).
Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases.
High tolerance to aggressive learning rates.
⚠️ Cons Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps.

💡 Best Practice: Set K_warmup_steps equal to your standard LR warmup steps. During warmup, the optimizer uses the static beta2; adaptation begins only after warmup ends.

📚 Reference:


📚 References

  1. Revisiting BFloat16 Training
  2. SMMF: Square-Matricized Momentum Factorization
  3. Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair
  4. Scaling Exponents Across Parameterizations and Optimizers

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adv_optm-2.4.dev21.tar.gz (62.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adv_optm-2.4.dev21-py3-none-any.whl (85.1 kB view details)

Uploaded Python 3

File details

Details for the file adv_optm-2.4.dev21.tar.gz.

File metadata

  • Download URL: adv_optm-2.4.dev21.tar.gz
  • Upload date:
  • Size: 62.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for adv_optm-2.4.dev21.tar.gz
Algorithm Hash digest
SHA256 f1efc1d88688fb65eddde8a79c224912bbb257351c988afaefaead5448f7040c
MD5 c7f019992345935fa21171663f00198a
BLAKE2b-256 cf4af6b730112008c90135a80ac53a9d4857dae6aadb9c072e49cecbbd2c1621

See more details on using hashes here.

File details

Details for the file adv_optm-2.4.dev21-py3-none-any.whl.

File metadata

  • Download URL: adv_optm-2.4.dev21-py3-none-any.whl
  • Upload date:
  • Size: 85.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for adv_optm-2.4.dev21-py3-none-any.whl
Algorithm Hash digest
SHA256 57e8a4284655c3a0e236b1f7902b5c235d070d910d2de537acdd2a59617a737a
MD5 62517708ab1d2ab8401fa0114674f508
BLAKE2b-256 4ba80c35796cb6f488a0235b906ccd19d18cf44d02712f02b3478c42180b4929

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page