Convert safetensors weights to quantized formats (FP8, INT8) with learned rounding optimization
Project description
convert_to_quant
Convert safetensors weights to quantized formats (FP8, INT8) with learned rounding optimization for ComfyUI inference.
Installation
[!IMPORTANT] PyTorch must be installed first with the correct CUDA version for your GPU. This package does not install PyTorch automatically to avoid conflicts with your existing setup.
Step 1: Install PyTorch (GPU-specific)
Visit pytorch.org to get the correct install command for your system.
Examples:
# CUDA 13.0 (newest)
pip install torch --index-url https://download.pytorch.org/whl/cu130
# CUDA 12.8 (stable)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# CUDA 12.6
pip install torch --index-url https://download.pytorch.org/whl/cu126
# CPU only (no GPU acceleration)
pip install torch --index-url https://download.pytorch.org/whl/cpu
Step 2: Install convert_to_quant
# Install from PyPI (when available)
pip install convert_to_quant
# Or install from source
git clone https://github.com/silveroxides/convert_to_quant.git
cd convert_to_quant
pip install -e .
Optional: Triton (needed for INT8)
On Linux
pip install -U triton
On Windows
for torch>=2.9
pip install -U "triton-windows<3.6"
for torch>=2.8
pip install -U "triton-windows<3.5"
for torch>=2.7
pip install -U "triton-windows<3.4"
for torch>=2.6
pip install -U "triton-windows<3.3"
Quick Start
# Basic FP8 quantization
convert_to_quant -i model.safetensors
# FP8 with ComfyUI metadata (recommended)
convert_to_quant -i model.safetensors --comfy_quant
# With custom learning rate (adaptive schedule by default)
convert_to_quant -i model.safetensors --comfy_quant --lr 0.01
# With plateau LR schedule for better convergence
convert_to_quant -i model.safetensors --comfy_quant --lr_schedule plateau --lr_patience 9 --lr_factor 0.92
Load the output .safetensors file in ComfyUI like any other model.
Supported Quantization Formats
| Format | CLI Flag | Hardware | Optimization |
|---|---|---|---|
| FP8 (E4M3) | (default) | Ada/Hopper+ | Learned Rounding (SVD) |
| INT8 Block-wise | --int8 |
Any GPU | Learned Rounding (SVD) |
| INT8 Tensor-wise | --int8 --scaling_mode tensor |
Any GPU | High-perf _scaled_mm |
| NVFP4 (4-bit) | --nvfp4 |
Blackwell | Dual-scale optimization |
| MXFP8 | --mxfp8 |
Blackwell | Microscaling (E8M0) |
For a deep dive into how these formats work and their technical implementation, see FORMATS.md.
Model-Specific Presets
| Model | Flag | Notes |
|---|---|---|
| Flux.2 | --flux2 |
Keep modulation/guidance/time/final high-precision |
| Chroma / Radiance | --distillation_large / --nerf_large |
Distillation layers excluded |
| T5-XXL Text Encoder | --t5xxl |
Decoder removed |
| Mistral Text Encoder | --mistral |
Norms/biases excluded |
| Visual Encoder | --visual |
MLP layers excluded |
| Hunyuan Video | --hunyuan |
Attention norms excluded |
| WAN Video | --wan |
Time embeddings excluded |
| Qwen Image | --qwen |
Image layers excluded |
| Z-Image | --zimage / --zimage_refiner |
Refiner excludes context/noise refiner |
Documentation
- 📖 MANUAL.md - Complete usage guide with examples and troubleshooting
- 📚 FORMATS.md - Technical reference for quantization formats and SVD optimization
- 📋 AGENTS.md - Developer guide & registry architecture
- ✨ ACTIVE.md - Current status and active implementations
- 🧪 DEVELOPMENT.md - Changelog and research notes
- 🔗 quantization.examples.md - ComfyUI integration patterns
Project Structure
convert_to_quant/
├── convert_to_quant/ # Main package
│ ├── cli/ # CLI entry point & argument parsing
│ ├── converters/ # Core quantization logic (FP8, INT8, NVFP4)
│ ├── formats/ # Format-specific conversion flows
│ ├── comfy/ # ComfyUI integration components
│ ├── config/ # Layer configuration & templates
│ ├── utils/ # Shared utilities (tensor, memory)
│ ├── constants.py # Model Filter Registry & constants
│ └── convert_to_quant.py # Backward-compatibility wrapper
├── pyproject.toml # Package configuration
├── MANUAL.md # User documentation
└── ...
Key Features
- Learned Rounding: SVD-based optimization minimizes quantization error in weight's principal directions
- Multiple Optimizers: Original (adaptive LR), AdamW, RAdam
- Bias Correction: Automatic bias adjustment using synthetic calibration data
- Model-Specific Support: Exclusion lists for sensitive layers (norms, embeddings, distillation)
- Triton Kernels: GPU-accelerated quantization/dequantization with fallback to PyTorch
- Three-Tier Quantization: Mix different formats per layer using
--custom-layersand--fallback - Layer Config JSON: Fine-grained per-layer control with regex pattern matching
- LR Schedules: Adaptive, exponential, and plateau learning rate scheduling
Advanced Usage
Layer Config JSON
Define per-layer quantization settings with regex patterns:
# Generate a template from your model
convert_to_quant -i model.safetensors --dry-run --layer-config-template layers.json
# Apply custom layer config
convert_to_quant -i model.safetensors --layer-config layers.json --comfy_quant
Scaling Modes
# Tensor-wise scaling (default)
convert_to_quant -i model.safetensors --scaling-mode tensor --comfy_quant
# Block-wise scaling for better accuracy
convert_to_quant -i model.safetensors --scaling-mode block --block_size 64 --comfy_quant
Additional Help
# View experimental features
convert_to_quant --help-experimental
# View model-specific filter presets
convert_to_quant --help-filters
Usage Examples
INT8 with performance heuristics
convert_to_quant -i model.safetensors --int8 --block_size 128 --comfy_quant --heur
Blackwell NVFP4 (4-bit)
convert_to_quant -i model.safetensors --nvfp4 --comfy_quant
Requirements
- Python 3.9+
- PyTorch 2.1+ (with CUDA for GPU acceleration)
- safetensors >= 0.4.2
- tqdm
- (Optional) triton >= 2.1.0 for INT8 kernels
Acknowledgements
Special thanks to:
- Clybius – For inspiring me to take on quantization and his Learned-Rounding repository.
- lyogavin – For ComfyUI PR #10864 adding
int8_blockwiseformat support and int8 kernels.
References
- DeepSeek scaled FP8 matmul: https://github.com/deepseek-ai/DeepSeek-V3
- JetFire paper: https://arxiv.org/abs/2403.12422
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file convert_to_quant-1.0.1.tar.gz.
File metadata
- Download URL: convert_to_quant-1.0.1.tar.gz
- Upload date:
- Size: 117.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f4aea7e0534110e1f41aa39c72bc77f20b2cdf2c3f12d73fc4e25f03248851e
|
|
| MD5 |
27c613de6106e3ea7ec568ace2b525af
|
|
| BLAKE2b-256 |
348b28034fdeb76f457e42c9d64d2388e8d2b1f860d70c8d85658e1527ba64c4
|
File details
Details for the file convert_to_quant-1.0.1-py3-none-any.whl.
File metadata
- Download URL: convert_to_quant-1.0.1-py3-none-any.whl
- Upload date:
- Size: 133.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebabab83cdb00b29958264bb095e8116bdf3b03d5055d88ffd40a17f538c8855
|
|
| MD5 |
3c8d3c665185176275ea635df4b67963
|
|
| BLAKE2b-256 |
df1fb1c4292bb5f2cf766314a883550226e186f6271c11c30402cbf1ca6d7fe9
|