Skip to main content

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Project description

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Official PyTorch Lightning implementation of our paper:

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra

BluOrion

Paper

---

🚀 Installation

You can install this package using pip:

Basic Installation

pip install git+https://github.com/bluorion-com/ZClip.git

With PyTorch Lightning Support

pip install "git+https://github.com/bluorion-com/ZClip.git#egg=zclip[lightning]"

🧠 Algorithm Overview

ZClip is an adaptive gradient clipping technique designed to mitigate gradient spikes by tracking running statistics of gradient norms through Exponential Moving Averages (EMA). At each training step, it updates the mean and variance of the gradient norm without storing historical data, allowing it to respond quickly to shifts in training dynamics.

When the current gradient norm deviates significantly from recent trends, ZClip dynamically computes a clipping threshold based on the observed variance. This approach automatically suppresses unusually large gradient updates—often the cause of loss spikes—without relying on fixed, manually-tuned thresholds.

By continuously adapting to the evolving scale and variability of gradients, ZClip ensures greater training stability and maintains learning efficiency, even under high learning rates or aggressive scheduling.

📚 Usage

Basic Usage

from zclip import ZClip

model = YourModel()  # Your PyTorch model
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Initialize ZClip
zclip = ZClip(alpha=0.97, z_thresh=2.5)

# Training loop
for batch in dataloader:
    # Forward and backward pass
    loss = model(batch)
    loss.backward()
    
    # Apply ZClip before optimizer step
    zclip.step(model)
    
    # Update weights
    optimizer.step()
    optimizer.zero_grad()

PyTorch Lightning (with optional dependency)

from lightning import Trainer
from zclip import ZClipLightningCallback

# Create a Lightning Trainer with ZClip
trainer = Trainer(
    callbacks=[
        ZClipLightningCallback(alpha=0.97, z_thresh=2.5)
    ]
)

# Train your model
trainer.fit(model, dataloader)

📉 Example Impact


Training Loss

Gradient Norm after Clipping

⚙️ Implementation Details

Our code is built within the PyTorch Lightning framework, utilizing its callback system for seamless integration into the training pipeline. It is fully compatible with FSDP and requires no code changes to work out of the box.

You can also use ZClip directly with standard PyTorch by calling .step(model) after loss.backward() and before optimizer.step().


🔬 Testing & Development

ZClip comes with a comprehensive test suite to ensure reliability and correctness.

Running Tests

./run_tests.sh

Continuous Integration

We use circleci for continuous integration, which runs tests on every commit and pull request.

CircleCI

🧪 Usage

PyTorch

from zclip import ZClip

zclip = ZClip(mode="zscore", alpha=0.97, z_thresh=2.5, clip_option="adaptive_scaling", max_grad_norm=1.0, clip_factor=1.0)

for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    loss.backward()
    zclip.step(model)
    optimizer.step()

PyTorch Lightning

from zclip import ZClipLightningCallback

zclip_cb = ZClipLightningCallback(mode="zscore", alpha=0.97, z_thresh=2.5, clip_option="adaptive_scaling", max_grad_norm=1.0, clip_factor=1.0)

trainer = L.Trainer(
    callbacks=[zclip_cb]
)

🔍 ZClip Parameters

Argument Description Default
mode Clipping mode. Options:
"zscore" – Uses z‑score based clipping.
"percentile" – Uses fixed threshold clipping defined as EMA mean plus (z_thresh × std).
"zscore"
z_thresh Threshold value. In "zscore" mode, it sets the z‑score threshold; in "percentile" mode, it is used as the multiplier for std. 2.5
alpha EMA smoothing factor for updating the gradient norm statistics. 0.97
clip_option (Only for "zscore" mode) Clipping strategy:
"adaptive_scaling" – Compute an adaptive threshold if the z‑score is high.
"mean" – Clip to the EMA mean.
"adaptive_scaling"
clip_factor Constant Multiplier for the adaptive scaling threshold. A value between 0.5 and 0.9 yields more aggressive clipping, while a higher value (default 1.0) is less aggressive. 1.0
max_grad_norm Optional maximum gradient norm to limit the clipping threshold. 1.0
warmup_steps Number of steps to collect gradient norms for initializing the EMA statistics. 25

Aggressive Hyperparameter Settings

When training models with volatile gradients, noisy data, or when using curriculum learning strategies, more aggressive gradient clipping can be beneficial. In such scenarios, consider adjusting the following parameters:

  • alpha:
    The alpha parameter controls the smoothing of the EMA for gradient norm statistics. A lower value (e.g. around 0.90-0.95) makes the EMA more responsive to recent gradients, which can be beneficial for rapidly changing gradient distributions. However, setting it too low might introduce noise into the EMA estimate, so it must be balanced carefully.

  • z_thresh:
    You may also consider reducing the z_thresh slightly (for example, from the default 2.5 to around 2.0) to tighten the criteria for clipping further.

  • clip_factor:
    Lowering the clip_factor to a value between 0.5 and 0.9 will reduce the adaptive threshold in the "adaptive_scaling" mode, resulting in more aggressive clipping. This can help stabilize training by curbing large gradient spikes.

These settings are particularly useful in scenarios where the gradient distribution is highly dynamic. Adjust and monitor these hyperparameters based on your specific model, dataset, and training dynamics to achieve optimal performance.

Citation

@misc{kumar2025zclipadaptivespikemitigation,
      title={ZClip: Adaptive Spike Mitigation for LLM Pre-Training}, 
      author={Abhay Kumar and Louis Owen and Nilabhra Roy Chowdhury and Fabian Güra},
      year={2025},
      eprint={2504.02507},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.02507}, 
}

📜 License

Apache-2.0 license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zclip-1.0.0.tar.gz (358.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zclip-1.0.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file zclip-1.0.0.tar.gz.

File metadata

  • Download URL: zclip-1.0.0.tar.gz
  • Upload date:
  • Size: 358.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for zclip-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8fd6098e4b7ba81884c2493b187cc87d527e2be5478e2c8d4d77ab85318e622f
MD5 68f162a13723389061f9ad0951238cbd
BLAKE2b-256 051ceaafbd39b958ee37c07e17263b45c3df53fe834e48d3f17f4557ab19a495

See more details on using hashes here.

Provenance

The following attestation bundles were made for zclip-1.0.0.tar.gz:

Publisher: ci-cd.yml on bluorion-com/ZClip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zclip-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: zclip-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for zclip-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4156306a92d9f1fe94f8d4785cabe02889a883cf725c596ef03bf7e8140e8e3
MD5 a9f77388c31aec476a27bd6c58884db7
BLAKE2b-256 7c9b5a168261fce9f35b1b37d2cc9053b3fdb41407905acc82e94ae1b554964d

See more details on using hashes here.

Provenance

The following attestation bundles were made for zclip-1.0.0-py3-none-any.whl:

Publisher: ci-cd.yml on bluorion-com/ZClip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page