Single-command EXL3 quantization + measurement + reporting pipeline

Project description

ezexl3

ezexl3 is a simplified interface for exllamav3: quantize, verify, benchmark, visualize, upload, and chat. One pip install, one CLI.

pip install ezexl3

or for custom templates, use a local editable install

git clone https://github.com/UnstableLlama/ezexl3/
cd ezexl3
pip install -e .

Requires a local installation of exllamav3.

Quick Start

Dashboard

ezexl3 ui

Launches a web dashboard on port 8801. Every CLI subcommand is a clickable form with live terminal output via SSE streaming. Real-time measurement table and SVG graph update as your quant runs. GPU auto-detection. Boolean arguments exposed as toggles. This is the easiest way to use ezexl3.

The Evals tab shows perf measurements (prefill and generation tokens/s across context lengths) on a dual-axis chart, and the catbench gallery if you ran one. Switch between BPWs with the dropdown.

Chat

ezexl3 chat

Launches a lightweight chat web interface for testing quantized models. Browse to your model in the file picker, select GPUs, click load. Branching conversation tree with regeneration, message editing, and sibling navigation. Exllama native, based on chat.py and the generator. No CLI flags needed.

Supports multi-GPU (-d 0,1), configurable sequence length (cache is sized 2x behind the scenes), and cache quantization (-cq 6,6). Auto-detects prompt format from the model name. Useful for spot-checking quant quality at different BPW levels before uploading.

CLI Pipeline

Run the full pipeline from the command line:

ezexl3 repo -m /path/to/base_model -b 2,2.5,3,4,5,6 -d 0,1 -t basic

What the pipeline does

ezexl3 wraps the exllamav3 quantization and evaluation workflow into a single command that:

Interleaves quantize → verify per BPW: each BPW is quantized then immediately verified (KL + PPL) before proceeding, halting on error
Multi-GPU acceleration for both quantization and verification. KL and PPL run in parallel on 2+ GPUs
Supports optimized BPWs (2.1 bpw, 3.5 bpw etc.)
Measures KL divergence + PPL @ 200k tokens, recording data to CSV
Optional perf measurement (prefill and generation tokens/s across context lengths) with its own SQLite database
Generates a HuggingFace-ready README.md with your measurements using customizable templates
Embeds an SVG graph from the measurement CSV in the README
Optional catbench integration. Generates SVG kitten drawings at each BPW and assembles them into a grid
Optional HuggingFace upload, with metadata locks and a dry-run preview before any repos are created
Checkpoints and resumes intelligently

model → [quantize → verify KL+PPL] per BPW → optimize → evals → graph → README → upload

Single-stage subcommands

If you only want to run specific stages:

# Quantize only
ezexl3 quantize -m /path/to/base_model -b 2,2.5,3,4,5,6 -d 0,1

# Quantize with optimized target (automatically ensures integer neighbors)
ezexl3 repo -m /path/to/base_model -b 4.07 -d 0

# Measure only
ezexl3 measure -m /path/to/base_model -b 2,3,4,5,6 -d 0,1

# Generate README only (from existing CSV)
ezexl3 readme -m /path/to/base_model -t fire

# Upload to HuggingFace (dry-run by default)
ezexl3 upload -m /path/to/base_model

(but really everything is checkpointed so it usually doesn't hurt to just run the "repo" command every time)

Per-BPW Paint Flags

The dashboard exposes four paint buttons that toggle quantization flags on individual BPW tokens. Click a button, then click a BPW in the parsed-token row to apply it:

-hq — high-quality boost, useful on low BPWs where the head needs the extra precision
-hb 8 — 8-bit head, useful on high BPWs where the rest is small enough to spare the head
-opt — opt-in optimized fractional pipeline (only applies to fractional BPWs)
-pm — global MoE speedup, applies to all BPWs at once

The same flags work from the CLI via `--quant-args`, but the dashboard is faster for mixing them across BPWs.

Template System

You can customize the generated README by providing a template name via --template or -t. Templates are stored in the /ezexl3/templates/ directory — just use the short name:

ezexl3 repo -m /path/to/base_model -t fire -b 2,3,4,5,6 -d 0,1

If no template is specified, it defaults to basic.

Easily generate your own custom template with AI assistance!

Copy and paste any template from /ezexl3/templates/ into your favorite LLM (Gemini, Claude, ChatGPT) along with this example prompt, followed by your own description:

Take this template, keep the main layout and variables, and modify it aesthetically based on my following prompts. Preserve all of the labels and title strings, only change the aesthetic, not the words or numbers:

*Make it dark and understated, high contrast, professional, metallic.*

Then save the result in /ezexl3/templates/ and use it with -t yourname.

Catbench

SVG Catbench is available as a measurement option via the -cb flag. It runs catbench inference at every BPW level (including optimized fractionals), extracts SVGs, and assembles them into a grid in the final README.

ezexl3 repo -m /path/to/base_model -b 2,3,4,5,6,8 -d 0,1 -t punk -cb

-cb alone runs 3 samples per BPW (default), -cb 5 runs 5
Catbench runs as a batch pass after KL/PPL/perf complete, using the multi-GPU queue
VRAM pre-flight check before each catbench load — skips gracefully if model won't fit, automatically uses multi-GPU for large models
Best valid SVG is selected from N samples for the grid
SVG extraction and grid assembly happen in a batch pass after all inference completes
Catbench results are checkpointed like everything else — rerunning skips completed samples
bf16 baseline included when VRAM allows

HuggingFace Upload

The Upload tab (or ezexl3 upload) creates HuggingFace repos for your quants. Defaults to dry-run mode so you see exactly what repo names will be created before anything is published.

# Preview what would be created
ezexl3 upload -m /path/to/base_model

# Actually create and upload
ezexl3 upload -m /path/to/base_model --no-dry-run

Single mode (default): one standalone repo per BPW, named MODEL-exl3-BPW. Recommended.
Branched mode: one repo with each BPW as a separate branch. Note that HuggingFace's download counter does not count branches — branched repos show only the main branch's downloads. Standalone repos preserve your download numbers.
Metadata fields (Author, Model Name, Repo Link, Quantized By) lock during the README write phase so the values can't drift mid-pipeline.
Preflight check verifies your HF token before any repos are created.

Inference Evaluation with WebUI

ezexl3 includes a lightweight chat web interface for quickly testing quantized models. Exllama native, based on chat.py and the generator.

ezexl3 chat -m /path/to/quantized_model -d 0

Advanced: Passthrough Flags

You can pass custom arguments directly to the underlying quantization (multiConvert) or measurement scripts using the --quant-args and --measure-args flags.

Important: These flags require a double-dash -- delimiter to separate the passthrough block from the rest of the arguments.

# Pass custom calibration dataset to quantization
ezexl3 repo -m /path/to/model -b 4.0 --quant-args -- -pm

# Pass custom rows/device settings to measurement
ezexl3 repo -m /path/to/model -b 4.0 --measure-args -- -r 200 -d 0

Common Use Cases:

Quantization: -pm (MoE speedup)
Measurement: -r / --rows (number of rows for PPL)

Note: passthrough blocks consume remaining args until another passthrough block starts, so keep normal CLI flags (like --no-readme) before --measure-args -- ...

`--no-verify` (Legacy Batch Mode)

By default, ezexl3 interleaves quantization with KL/PPL verification per BPW. Use --no-verify (or -nv) to revert to the old batch pipeline (all quants first, then all measurements):

ezexl3 repo -m /path/to/model -b 2,3,4,5,6 -d 0,1 --no-verify

This is useful if you're confident in your quantization setup and want to let everything run unattended without per-BPW halting.

Optimized BPW workflow

If you request an optimized BPW (for example 4.07), ezexl3 executes the following order:

Detect optimized targets and remove them from the initial integer quant queue.
Ensure required neighboring integers exist in the quant queue (4 and 5 for 4.07).
Quantize each integer BPW one at a time, verifying KL+PPL immediately after each (halts on error). With 2+ GPUs, KL and PPL run in parallel during verification.
Run exllamav3 util/measure.py in a dynamic multi-GPU queue for required integer pairs (resume-safe: skips if measurements/<low>-<high>_measurement.json exists), with terminal logs when jobs are assigned and completed per GPU.
Run exllamav3 util/optimize.py to build the optimized output directory.
Verify each optimized BPW with KL+PPL measurement (halts on error).

To locate exllamav3 utility scripts, ezexl3 uses bundled vendored copies (no manual path configuration needed).

Headless Mode

For automated pipelines, use the --no-prompt (or -np) flag to skip interactive metadata collection for the README. It will use sensible defaults based on the model directory name and your environment.

ezexl3 repo -m /path/to/model -b 4.0 --no-prompt

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 17, 2026

0.0.9

Mar 30, 2026

0.0.8

Mar 27, 2026

0.0.7

Mar 23, 2026

0.0.6

Mar 23, 2026

0.0.5

Mar 23, 2026

0.0.4

Mar 23, 2026

0.0.3

Mar 23, 2026

0.0.1

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezexl3-0.1.0.tar.gz (603.4 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ezexl3-0.1.0-py3-none-any.whl (589.3 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file ezexl3-0.1.0.tar.gz.

File metadata

Download URL: ezexl3-0.1.0.tar.gz
Upload date: Apr 17, 2026
Size: 603.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ezexl3-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ed8a8ad8ff462f51f17fcf009d68caf7338dd55a2cd5c2c4904e6f88f6f024aa`
MD5	`51ee0f24953ba5b53774d1e9490bc957`
BLAKE2b-256	`58b51f19a633962c5aa4e5c4930e4732ceca427eaa8763d7d17f44af21db5b34`

See more details on using hashes here.

File details

Details for the file ezexl3-0.1.0-py3-none-any.whl.

File metadata

Download URL: ezexl3-0.1.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 589.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ezexl3-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e05f66e8d1a44b35309eaafeb536824b246a7c8c0c864502f5cfc5e63c5761d6`
MD5	`5db334e5d3b15a6099a4abff9185fa66`
BLAKE2b-256	`05286439ec0272dec4e86f308f7711472826b75b3e63812f7d6c64f8b3fe35bb`

See more details on using hashes here.

ezexl3 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ezexl3

Quick Start

Dashboard

Chat

CLI Pipeline

What the pipeline does

Single-stage subcommands

Per-BPW Paint Flags

Template System

Catbench

HuggingFace Upload

Inference Evaluation with WebUI

Advanced: Passthrough Flags

`--no-verify` (Legacy Batch Mode)

Optimized BPW workflow

Headless Mode

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

ezexl3 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ezexl3

Quick Start

Dashboard

Chat

CLI Pipeline

What the pipeline does

Single-stage subcommands

Per-BPW Paint Flags

Template System

Catbench

HuggingFace Upload

Inference Evaluation with WebUI

Advanced: Passthrough Flags

--no-verify (Legacy Batch Mode)

Optimized BPW workflow

Headless Mode

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`--no-verify` (Legacy Batch Mode)