8-bit Adafactor Optimizer with Fused CUDA Kernels

These details have not been verified by PyPI

Project links

Homepage

Project description

8-bit Adafactor with Fused CUDA Kernels

English | 中文

An 8-bit Adafactor optimizer featuring fused CUDA kernels and log-space block-wise quantization, designed to further reduce optimizer state memory while maintaining low step overhead and stability — suitable for large models such as LLMs and diffusion models.

Key Features

Log-Space Quantization: Maps the second moment (variance) to the log2 space before 8-bit quantization. This approach accommodates the long-tail distribution of variances, reducing the risk of small second-moment estimates being truncated to zero and improving overall training stability.
Fused CUDA Kernels: Combines dequantization, EMA updates, Warp-Shuffle reductions, and requantization into single kernels. It utilizes float4 vectorization to optimize memory bandwidth usage.
Zero CPU-GPU Sync: Eliminates implicit synchronizations (e.g., D2H copies) in the control flow, ensuring the GPU computation pipeline runs without blocking.
Cross-Platform JIT: Uses Just-In-Time (JIT) compilation for straightforward setup across both Windows and Linux environments.

Performance

Memory Footprint: Due to Adafactor's factorized second-moment estimation and 8-bit quantization, the optimizer state memory usage is generally lower than that of AdamW8Bit.
Training Speed: The fused kernel design and reduced synchronization overhead allow it to achieve step times comparable to other mainstream 8-bit optimizers.
Quantization Precision: The second moment (variance) in Adafactor is strictly non-negative and spans multiple orders of magnitude. By mapping it to UINT8 in log2 space rather than linear space, the optimizer preserves relative precision for small variances, mitigating the instability often caused by outlier gradients in standard 8-bit quantization.

Installation

This project uses JIT (Just-In-Time) compilation.

Please ensure torch and ninja are installed, and a CUDA compiler (such as MSVC or GCC) is available in your environment.

If CUDA compilation fails, the optimizer will automatically fall back to the pure PyTorch implementation.

From PyPI

pip install -U adafactor8bit

From Source

pip install git+https://github.com/yanfeiwong/adafactor-8bit.git

Note: The first time you instantiate the optimizer (or run the example script), it will automatically trigger the JIT compilation of the CUDA source code in the background. This may take anywhere from a few seconds to a couple of minutes depending on your system, and the terminal might appear unresponsive. Once compiled, the binary will be cached, and all subsequent runs will be instantaneous.

Usage Example

It is recommended to use param_groups to keep sensitive layers (Embedding, Norm, Bias) in FP32, enabling 8-bit quantization only for large 2D weight matrices.

import torch
import torch.nn as nn
from adafactor8bit import Adafactor8Bit

def get_param_groups(model, weight_decay=1e-2):
    decay, no_decay = [], []
    for name, param in model.named_parameters():
        if not param.requires_grad: continue
        # Protect 1D tensors, biases, norms, and embeddings
        if param.ndim <= 1 or "bias" in name or "norm" in name or "embed" in name:
            no_decay.append(param)
        else:
            decay.append(param)
            
    return [
        {"params": decay, "weight_decay": weight_decay, "quantize": True},
        {"params": no_decay, "weight_decay": 0.0, "quantize": False}
    ]

model = MyModel().cuda()
optimizer = Adafactor8Bit(
    get_param_groups(model), 
    lr=1e-3, 
    # For continual learning with external scheduler
    relative_step=False,     # Disable internal LR scheduling
    beta2=0.999,             # Lock EMA window to prevent "blunting" over steps
)

# Training loop...

For a complete example, please refer to basic_usage.py.

Advanced Configuration

Continual Learning (`beta2` & `relative_step`)

By default, Adafactor's second-moment decay rate dynamically decays with the training step, and the internal learning rate schedule (relative_step) scales the learning rate accordingly.

For endless fine-tuning or lifelong learning, this often leads to overly small learning rates and "blunted" second-moment estimates. To avoid these issues and keep the optimizer responsive:

Set relative_step=False to disable the built-in LR schedule (allowing you to use an external scheduler).
Set beta2=0.999 to lock the EMA window (similar to Adam).

Decoupled Weight Decay (`scale_weight_decay=False`)

By default, Adafactor's weight decay is coupled with the parameter's RMS scale.

If you prefer the AdamW-style decoupled weight decay, set scale_weight_decay=False.

No-Compiler Environments (`use_cuda_kernel=False`)

If you are in an environment without a CUDA compiler and want to bypass JIT compilation entirely:

Set use_cuda_kernel=False to fall back to the pure PyTorch implementation.

Learning Rate Guide for Beginners

If you are migrating from optimizers like AdamW, Adafactor's learning rate behavior might feel a bit different. This is mainly due to the scale_parameter option.

scale_parameter=True (default) Because of RMS scaling, a very small lr (e.g., 1e-5) often leads to extremely slow progress. Start with lr=1e-3 and adjust in the range 1e-4–5e-3 if needed.
scale_parameter=False Disables RMS scaling, making the update scale more similar to AdamW. Use the learning rates you're familiar with for AdamW and tune as usual. (Note: the second moment is still factorized, so behavior is not identical.)

These are safe starting points; Always validate on your own task and batch size.

Acknowledgements

Thanks to Noam Shazeer and Mitchell Stern for proposing the original Adafactor algorithm in the paper Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.

Thanks to Tim Dettmers for the inspiration from the paper 8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION and the bitsandbytes library.

Thanks to the PyTorch team for providing the foundational Optimizer implementation and the C++ Extension toolchain.

Thanks to the large language models Qwen and DeepSeek for valuable technical discussions and code reviews on CUDA low-level optimization, memory safety mechanisms, and cross-platform compilation pipeline design.

License

The project is released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.8

Jun 13, 2026

0.1.7

Jun 9, 2026

This version

0.1.6

Jun 9, 2026

0.1.5

Jun 9, 2026

0.1.4

Jun 8, 2026

0.1.3

Jun 8, 2026

0.1.2

Jun 7, 2026

0.1.1 yanked

Jun 7, 2026

Reason this release was yanked:

Refined N-D tensor factorization logic to strictly align with the original Adafactor paper for 3D/4D layers (e.g., Conv2d, MoE). v0.1.1 may exhibit suboptimal training dynamics for these layers. Please upgrade to adafactor8bit>=0.1.2 for improved stability.

0.1.0

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adafactor8bit-0.1.6.tar.gz (17.4 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adafactor8bit-0.1.6-py3-none-any.whl (14.6 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file adafactor8bit-0.1.6.tar.gz.

File metadata

Download URL: adafactor8bit-0.1.6.tar.gz
Upload date: Jun 9, 2026
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for adafactor8bit-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`83275394b23108812d9701a2e61f8b50ddb8b1a514a98591cbf6e1c03a30f832`
MD5	`fb0aa3afc5a0d450d1b5b5991517c004`
BLAKE2b-256	`880aa166feda84bc7cc185e5958c810ec32cf14c7783c0628d090d22817beebb`

See more details on using hashes here.

File details

Details for the file adafactor8bit-0.1.6-py3-none-any.whl.

File metadata

Download URL: adafactor8bit-0.1.6-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for adafactor8bit-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed2ed3d9287e4fce45c8256c9daacd11ec1ef93266cbe958ccafe21722703ac2`
MD5	`d7b42957717e451cfdf577b5ee6aa241`
BLAKE2b-256	`cd6ba2557a19a37625ec12a50fde79a27dcc0405cab1babc5751cf1678e69b48`

See more details on using hashes here.

adafactor8bit 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

8-bit Adafactor with Fused CUDA Kernels

Key Features

Performance

Installation

From PyPI

From Source

Usage Example

Advanced Configuration

Continual Learning (`beta2` & `relative_step`)

Decoupled Weight Decay (`scale_weight_decay=False`)

No-Compiler Environments (`use_cuda_kernel=False`)

Learning Rate Guide for Beginners

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

adafactor8bit 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

8-bit Adafactor with Fused CUDA Kernels

Key Features

Performance

Installation

From PyPI

From Source

Usage Example

Advanced Configuration

Continual Learning (beta2 & relative_step)

Decoupled Weight Decay (scale_weight_decay=False)

No-Compiler Environments (use_cuda_kernel=False)

Learning Rate Guide for Beginners

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Continual Learning (`beta2` & `relative_step`)

Decoupled Weight Decay (`scale_weight_decay=False`)

No-Compiler Environments (`use_cuda_kernel=False`)