Skip to main content

Ultra-Fused Transformer with SDLA, MX Quantization, and FQT

Project description

Ultra-Fused Transformer v6.1 — SDLA with DeepSeek-MLA Compression

High-performance transformer library featuring Selective Differential Linear Attention (SDLA) with DeepSeek-MLA style Low-rank Compression and YaRN Long-Context Extension.

Key Innovations v6.1

1. DeepSeek-MLA Style Low-rank Compression

  • Latent Projection: d_model → d_model // compression_ratio before attention
  • Decoupled RoPE: Separate content and positional projections (like DeepSeek-V3)
  • Full QKV Compression: Not just KV, but entire attention space compressed
  • Memory Savings: 4-16x reduction vs standard attention

2. YaRN / SuFT Long Context Extension

  • 4-8x context extension without retraining
  • NTK-by-parts interpolation with temperature scaling
  • Supports sequences up to 8x training length at inference
  • Based on: YaRN: Efficient Context Window Extension

3. Learnable Lambda + RMSNorm Stabilization

  • Per-head learnable λ: Each head learns its own denoising strength
  • Layer-scale λ: Global multiplier across all heads
  • RMSNorm after differential: Maintains variance ≈ 1.0 after Q1K1 - λ·Q2K2
  • Prevents gradient explosion in deep differential networks

4. Dynamic Router (3-Level Entropy-Based)

  • Level 1 (Early Exit): Cheap FFN only — saves 80% compute
  • Level 2 (Alpha Blend): Weighted mix of cheap + full FFN
  • Level 3 (Full Compute): Both branches at full strength
  • Router decides per-token based on attention entropy proxy

5. Fused Triton Kernel

  • Single kernel fuses: RMSNorm + QKV Projection + Differential prep
  • 2-3x speedup over sequential execution on GPU
  • Ready for CUDA deployment with Triton

6. Microscaling (MX) Quantization + FQT

  • OCP-compliant MXFP4 with block-wise E8M0 scales
  • FP8/INT8 backward pass for Fully Quantized Training
  • Outlier Isolation (IQR 3.5) for near-lossless compression

Architecture Comparison

Feature Transformer Mamba MLA SDLA v6.1
Complexity O(N²) O(N) O(N²) O(N)
KV Memory O(N) O(1) O(N·r) O(1) fixed state
Long Context ⚠️ ⚠️ ✅ YaRN 4-8x
Low-rank Compression ✅ (KV only) ✅ (Full QKV)
Selective Focus ✅ (entropy gate)
Noise Filtering ✅ (differential)
Dynamic Compute ✅ (3-level router)
Learnable λ N/A N/A N/A ✅ Per-head + Layer
Variance Stabilization ✅ RMSNorm post-diff
Quantization FP16 FP16 FP16 MXFP4 + FQT

Training Results (100 steps, CPU)

Metric SDLA (Ours) MLA Baseline
Parameters 0.96M 0.42M
Final Loss 29.73 16.20
Avg Step Time 0.317s 0.030s
Total Time 31.7s 3.0s

Note: SDLA has higher computational cost due to recurrent state updates, differential attention, and dynamic routing — but offers significantly richer capabilities (O(N) complexity, selective attention, long-context extension). MLA is simpler and faster but lacks these advanced features.

Quick Start

# Install
pip install -e .

# Train both SDLA and MLA baseline
python scripts/train.py

# Run tests
python tests/test_import.py

# Load pretrained SDLA
python -c "
import torch
from ultra_fused.model.transformer import UltraFusedTransformer
ckpt = torch.load('checkpoints/sdla_100step.pt')
model = UltraFusedTransformer(ckpt['config'])
model.load_state_dict(ckpt['model'])
print('SDLA model loaded successfully')
"

Project Structure

src/ultra_fused/
├── config.py                  # UFTConfig with all v6.1 features
├── model/transformer.py       # Dual-mode: SDLA or MLA baseline
├── layers/
│   ├── sdla_attention.py      # SDLA v2.0: MLA + YaRN + Dynamic Router
│   ├── mla_baseline.py        # DeepSeek-MLA baseline for comparison
│   ├── quant_linear.py        # MXLinear (MXFP4 + FQT)
│   └── parallel_block.py      # Parallel Block with Dynamic Router
├── kernels/
│   ├── triton_kernels.py      # MXFP4 GEMM, Online TTT
│   └── fused_sdla_kernel.py   # Fused RMSNorm+QKV+Differential
└── utils/
    ├── mx_utils.py            # OCP Microscaling
    └── yarn_rope.py           # YaRN/SuFT long-context RoPE

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultra_fused_transformer-6.0.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultra_fused_transformer-6.0.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file ultra_fused_transformer-6.0.0.tar.gz.

File metadata

  • Download URL: ultra_fused_transformer-6.0.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ultra_fused_transformer-6.0.0.tar.gz
Algorithm Hash digest
SHA256 beae12bf68edbd9c92cae64c7cffcd4b5e08cbc5a53f2f10ba5ae2b82a793111
MD5 30cd39e946c465a54350c21ee35127ed
BLAKE2b-256 5b9a9b21b832b69d0a0acf7a3c74b7d517959f9b709d2fa2a26d6b0991b6cddf

See more details on using hashes here.

File details

Details for the file ultra_fused_transformer-6.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ultra_fused_transformer-6.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9cf167dbc76b413afbaf946075163e1fdc0acf32236faa5baf390919d237da5
MD5 7a6c951a88225cd00acaad1051c9956b
BLAKE2b-256 cba1052cb771035e3443a89a32a92906e8d5c9a92431854f74f43baa6c8b7028

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page