Single-command EXL3 quantization + measurement + reporting pipeline
Project description
ezexl3
ezexl3 is a single-command quantization and measurement pipeline that generates high-quality, HuggingFace-ready exl3 repos automatically.
It wraps the exllamav3 quantization and evaluation workflow into a tool that has:
- Interleaved quantize → verify pipeline: each BPW is quantized then immediately verified (KL + PPL) before proceeding, halting on error
- Multi-GPU acceleration for both quantization and verification — KL and PPL run in parallel on 2+ GPUs
- Supports optimized BPWs, (2.1 bpw, 3.5 bpw etc.)
- Measures KL divergence + PPL @ 200k tokens, recording data to CSV
- Generates a HuggingFace-ready
README.mdwith your measurements using customizable templates - Embeds an SVG graph from the measurement CSV in the README
- Optional catbench integration — generates SVG kitten drawings at each BPW and assembles them into a grid
- Checkpoints and resumes intelligently all from one command.
Pipeline:
model → [quantize → verify KL+PPL] per BPW → optimize → catbench → graph → README
Installation
This tool requires a local installation of exllamav3.
# 1. Make sure you have exllamav3 installed.
# 2. Clone and install ezexl3
git clone https://github.com/UnstableLlama/ezexl3
cd ezexl3
pip install -e .
Usage
Quantize a full repository
Run the entire pipeline (quantize → verify → README):
ezexl3 repo -m /path/to/base_model -b 2,2.5,3,4,5,6 -d 0,1 -t basic
Then ezexl3 automatically:
-
Quantizes each BPW one at a time, immediately running KL divergence and perplexity verification after each one. If verification fails, the pipeline halts — no time wasted quantizing remaining BPWs on top of a bad quant. With 2+ GPUs, KL and PPL run in parallel during verification.
-
Saves measurements to modelNameMeasured.csv in the base model folder, and makes a stylish dark mode SVG graph with the data.
-
Generates a README.md for a HuggingFace repo in the base model folder. (with optional customizable templates)
Single-stage subcommands
If you only want to run specific stages:
# Quantize only
ezexl3 quantize -m /path/to/base_model -b 2,2.5,3,4,5,6 -d 0,1
# Quantize with optimized target (automatically ensures integer neighbors)
ezexl3 repo -m /path/to/base_model -b 4.07 -d 0
# Measure only
ezexl3 measure -m /path/to/base_model -b 2,3,4,5,6 -d 0,1
# Generate README only (from existing CSV)
ezexl3 readme -m /path/to/base_model -t fire
(but really everything is checkpointed so it usually doesn't hurt to just run the "repo" command every time)
Template System
You can customize the generated README by providing a template name via --template or -t.
Templates are stored in the /ezexl3/templates/ directory — just use the short name:
ezexl3 repo -m /path/to/base_model -t fire -b 2,3,4,5,6 -d 0,1
If no template is specified, it defaults to basic.
Easily generate your own custom template with AI assistance!
Copy and paste any template from /ezexl3/templates/ into your favorite LLM (Gemini, Claude, ChatGPT) along with this example prompt, followed by your own description:
Take this template, keep the main layout and variables, and modify it aesthetically based on my following prompts. Preserve all of the labels and title strings, only change the aesthetic, not the words or numbers:
*Make it dark and understated, high contrast, professional, metallic.*
Then save the result in /ezexl3/templates/ and use it with -t yourname.
Catbench
SVG Catbench is available as a measurement option via the -cb flag. It runs catbench inference at every BPW level (including optimized fractionals), extracts SVGs, and assembles them into a grid in the final README.
ezexl3 repo -m /path/to/base_model -b 2,3,4,5,6,8 -d 0,1 -t punk -cb
-cbalone runs 3 samples per BPW (default),-cb 5runs 5- Catbench runs as a batch pass after all per-BPW verification completes, using the multi-GPU queue
- VRAM pre-flight check before each catbench load — skips gracefully if model won't fit, automatically uses multi-GPU for large models
- Best valid SVG is selected from N samples for the grid
- SVG extraction and grid assembly happen in a batch pass after all inference completes
- Catbench results are checkpointed like everything else — rerunning skips completed samples
- bf16 baseline included when VRAM allows
Inference Evaluation with WebUI
ezexl3 includes a lightweight chat web interface for quickly testing quantized models. Exllama native, based on chat.py and the generator.
ezexl3 chat -m /path/to/quantized_model -d 0
Advanced: Passthrough Flags
You can pass custom arguments directly to the underlying quantization (multiConvert) or measurement scripts using the --quant-args and --measure-args flags.
Important: These flags require a double-dash -- delimiter to separate the passthrough block from the rest of the arguments.
# Pass custom calibration dataset to quantization
ezexl3 repo -m /path/to/model -b 4.0 --quant-args -- -pm
# Pass custom rows/device settings to measurement
ezexl3 repo -m /path/to/model -b 4.0 --measure-args -- -r 200 -d 0
Common Use Cases:
- Quantization:
-pm(MoE speedup) - Measurement:
-r/--rows(number of rows for PPL)
Note: passthrough blocks consume remaining args until another passthrough block starts, so keep normal CLI flags (like --no-readme) before --measure-args -- ...
--no-verify (Legacy Batch Mode)
By default, ezexl3 interleaves quantization with KL/PPL verification per BPW. Use --no-verify (or -nv) to revert to the old batch pipeline (all quants first, then all measurements):
ezexl3 repo -m /path/to/model -b 2,3,4,5,6 -d 0,1 --no-verify
This is useful if you're confident in your quantization setup and want to let everything run unattended without per-BPW halting.
Optimized BPW workflow
If you request an optimized BPW (for example 4.07), ezexl3 executes the following order:
- Detect optimized targets and remove them from the initial integer quant queue.
- Ensure required neighboring integers exist in the quant queue (
4and5for4.07). - Quantize each integer BPW one at a time, verifying KL+PPL immediately after each (halts on error). With 2+ GPUs, KL and PPL run in parallel during verification.
- Run exllamav3
util/measure.pyin a dynamic multi-GPU queue for required integer pairs (resume-safe: skips ifmeasurements/<low>-<high>_measurement.jsonexists), with terminal logs when jobs are assigned and completed per GPU. - Run exllamav3
util/optimize.pyto build the optimized output directory. - Verify each optimized BPW with KL+PPL measurement (halts on error).
To locate exllamav3 utility scripts, ezexl3 uses bundled vendored copies (no manual path configuration needed).
Headless Mode
For automated pipelines, use the --no-prompt (or -np) flag to skip interactive metadata collection for the README. It will use sensible defaults based on the model directory name and your environment.
ezexl3 repo -m /path/to/model -b 4.0 --no-prompt
Supports multi-GPU (-d 0,1), configurable cache size (-cs 32768), and cache quantization (-cq 6,6). Auto-detects prompt format from the model name. Useful for spot-checking quant quality at different BPW levels before uploading.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ezexl3-0.0.8.tar.gz.
File metadata
- Download URL: ezexl3-0.0.8.tar.gz
- Upload date:
- Size: 493.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae9fde544e1617c4f130a9a5d071449cf465ad0cd97303c0cf2b0702a6942c06
|
|
| MD5 |
4b49877bcec10cb6429e7b37ebd90185
|
|
| BLAKE2b-256 |
e8a839c963a7e1cf20bcdaa8a189de177a751321783c09d8cc356d63ad361131
|
File details
Details for the file ezexl3-0.0.8-py3-none-any.whl.
File metadata
- Download URL: ezexl3-0.0.8-py3-none-any.whl
- Upload date:
- Size: 484.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f76c1b3259d8bd5c163f4fdbfdfd16d4b44812c3858f46d17c452f49e7cd822
|
|
| MD5 |
03bab9ac3516aacd69448e22c139b7c7
|
|
| BLAKE2b-256 |
41b0394a6b6a7ab5ad2274648829fb5ce11dfe8d938e87255ca8f60228f817ac
|