Adaptive Input/Output Normalization for deep neural networks. Enables stable training of extremely deep networks through adaptive residual scaling (Alpha) by Babayev's Theory
Project description
AION-Torch
[WARNING] Alpha Version: This library is currently in alpha. APIs may change without notice. Use at your own risk.
Adaptive Input/Output Normalization for deep neural networks. AION dynamically adjusts residual connection scaling for stable training of extremely deep networks.
What is AION?
AION (Adaptive Input/Output Normalization) is an adaptive residual scaling layer
that keeps the energy of residual branches in balance. Instead of using a fixed
scale for x + y, AION dynamically adjusts α in x + α·y based on the input
and output energies. This stabilizes very deep networks (hundreds of layers)
and improves convergence without manual tuning.
The Proof
Crash Test Results (600-layer Transformer, GPU)
AION demonstrates superior numerical stability and faster convergence.
600-layer transformer test on GPU: Both models completed all 150 training steps successfully. AION Transformer achieved significantly lower loss (0.0011 ± 0.0003) and more stable gradients compared to Standard Transformer (0.0075 ± 0.0015).
Benchmark Methodology:
Both models use Pre-LayerNorm architecture (normalization before the feedforward network), which is the standard practice in modern transformers (GPT, BERT, etc.). Pre-LayerNorm enables standard transformers to work at deep depths by normalizing activations before transformation, helping maintain stable gradient flow. We tested 600 layers to demonstrate AION's advantages at extreme depth while ensuring both models complete the full training run without memory constraints. This makes the comparison fair—both models use the same modern best practices, and AION still demonstrates superior stability and convergence speed even at these extreme depths.
Key Findings:
- Standard Transformer: Completed all 150 steps, final loss: 0.0075 ± 0.0015, crash rate: 0%
- AION Transformer: Completed all 150 steps, final loss: 0.0011 ± 0.0003, crash rate: 0%
- Gradient Stability: AION maintained more stable and lower gradient norms (0.0135 ± 0.0033) vs Standard (0.0665 ± 0.0116)
- Training Efficiency: AION achieved ~7x lower final loss, demonstrating significantly faster convergence
These results suggest that AION can improve numerical stability and convergence speed at extreme depths (600 layers), even on top of modern Pre-LayerNorm architectures.
Installation
Install from PyPI:
pip install aion-torch
Or install in development mode with dev dependencies:
pip install -e ".[dev]"
Quick Start
import torch
from aion_torch import AionResidual
# Create AION layer
layer = AionResidual(alpha0=0.1, beta=0.05)
# Use in residual connection
x = torch.randn(8, 128, 512) # [batch, seq, features]
y = torch.randn(8, 128, 512) # Output from FFN/attention
out = layer(x, y) # Adaptive residual: x + α·y
Overhead Benchmark Results (GPU)
AION adds ~36% computational overhead per training step.
Benchmark configuration: 4-layer transformer, batch size 8, sequence length 128, dimension 512. Results averaged over 150 training steps (after 20 warmup steps).
Performance Metrics (Unoptimized Baseline):
- Standard Residual: 9.79 ms/step (102.11 steps/sec)
- AION Residual: 13.36 ms/step (74.84 steps/sec)
- Overhead: +36.44% per training step
The overhead comes from AION's adaptive scaling calculations, which provide the stability benefits shown in the crash test.
There are several ways to reduce this cost in practice:
- k-update optimization: use
k_update > 1to update α less frequently (e.g.k_update=4reduces the AION-specific computation by ~75%). - Engineering optimizations: fusing operations, reusing statistics, or using lower precision for energy tracking. With careful optimization, we expect the overhead to be reduced to below ~5% in production setups.
Features
- Adaptive scaling: Automatically adjusts to network dynamics
- Training stability: Prevents gradient explosion and vanishing
- Deep network support: Works with networks of any depth
- Faster convergence: Achieves lower loss faster than standard residuals
- PyTorch 2.0+: Fully compatible with modern PyTorch
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Format code
make format
# Run linting
make lint
# Run tests
make test
# Install pre-commit hooks
make pre-commit-install
License
MIT License - see LICENSE file for details.
Note: This is an Alpha version. APIs may change without notice. Use at your own risk.
Author
Abbasagha Babayev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aion_torch-0.1.0.tar.gz.
File metadata
- Download URL: aion_torch-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f83c6a235c36334bee2a6a0a65fbaf24bc959105dbe939fc792f61619050123
|
|
| MD5 |
e11ff8cc64f8141d7a631b03c4827d29
|
|
| BLAKE2b-256 |
13b9817377dc1132410340b09e36985342b694796dd6fd27c93ac95bf5722924
|
File details
Details for the file aion_torch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: aion_torch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d1eacd639ee5d642696ee64a90c4ce86e5c629dc18ef4325c1a4b0332ada0bc
|
|
| MD5 |
d18a0d0c349954f33eabf69547289f41
|
|
| BLAKE2b-256 |
429da64f210d62c3b1270d073665099cbeb225eeea4e2c58d71f4c2f6f7e5b79
|