Skip to main content

Production-grade neural-network quantization framework with NSGA + ONNX + hardware-aware search

Project description

NeuroQuant v2.0

python license

Production-grade neural-network quantization framework with multi-objective NSGA search, ONNX deployment, and hardware-aware optimisation.

NeuroQuant takes a pre-trained PyTorch model and produces deployable INT8 / mixed-precision artefacts that have been measured (not estimated) on the same runtime that ships in production. Every public number is the result of running a real quantized graph through ONNX Runtime — no synthetic shortcuts.


What it does

   ┌────────────────────────────────────────────────────────────────────────┐
   │                                                                        │
   │  FP32 PyTorch model  ─────►  10-phase pipeline  ─────►  INT8 .onnx     │
   │                                                          + metrics     │
   │  ┌──────────────────────────────────────────────────────────────┐     │
   │  │  P0  Prepare model + dataset, FP32 baseline                  │     │
   │  │  P1a Hessian / Fisher per-layer sensitivity                  │     │
   │  │  P1b FITCompress warm-start seed                             │     │
   │  │  P1c NSGA multi-objective search (2- or 3-obj)               │     │
   │  │  P1d AdaRound canonical-order weight rounding                │     │
   │  │  P1e Real W+A QAT with FP32 teacher distillation             │     │
   │  │  P1f GPTQ + SmoothQuant + AWQ + SmoothQuant→GPTQ             │     │
   │  │  P2  Pareto analysis + plots                                 │     │
   │  │  P3  Grad-CAM + SHAP explainability                          │     │
   │  │  P4  MLflow finalisation + reproducibility manifest          │     │
   │  └──────────────────────────────────────────────────────────────┘     │
   │                                                                        │
   └────────────────────────────────────────────────────────────────────────┘

The pipeline runs to completion in ~60 seconds on CPU for a CIFAR-class model.


Why it is production-grade

This framework was built deliberately to avoid the "research prototype" failure modes that disqualify most academic quantization tooling from real deployment:

Concern What NeuroQuant does
Real INT inference Wave 4 emits true static-INT8 ONNX graphs via onnxruntime.quantization.quantize_static, not FP32 simulation.
Real on-disk size model_size_mb is the literal .onnx filesystem size, not numel × bw / 8. The synthetic estimate is kept as theoretical_size_mb for ablation.
Real latency latency_ms is measured under ONNX Runtime on the same machine that will deploy the artefact.
Hardware-aware search The NSGA third objective sums a per-layer ORT latency LUT (Wave 4 C2). Every gene's latency cost is a real timing.
No leakage between splits Train / search / val / test are 80/10/10/test-set; NSGA fitness reads search, QAT early-stop reads val, headline reads test.
Strict determinism set_seed(strict=True) enforces CUBLAS_WORKSPACE_CONFIG, use_deterministic_algorithms, cudnn.deterministic.
Safe checkpoints All torch.load(weights_only=True); pickle path is closed. Architectural wrappers persist as JSON manifests.
Real W+A QAT INT8 activations always; weight parametrisation via torch.nn.utils.parametrize (autograd-aware STE).
Validated config Pydantic v2 dataclasses with field validators — bad values fail at load, not deep in a phase.

Install

From the wheel

pip install neuroquant-2.0.0-py3-none-any.whl
neuroquant --help

From source

git clone https://github.com/AbdelazizElHelaly11/NeuroQuant
cd NeuroQuant
pip install -e ".[dev]"        # editable + dev extras

GPU users:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e ".[dev]"

Run

The console-script neuroquant is installed by the wheel; it accepts the same flags as python main.py.

# Full pipeline on the bundled config (CIFAR-10 + MobileNetV2)
neuroquant --config config.yaml --epochs 20

# Fast smoke (CPU, no training, first three phases)
neuroquant --config config.yaml --epochs 0 --device cpu \
  --phases phase_0_preparation phase_1a_hessian_clustering phase_1b_fitcompress

# Resume after interruption
neuroquant --config config.yaml --epochs 20 --resume

# Hardware-aware mode (3-objective NSGA + ORT latency LUT)
# Set hardware_aware_search: true in config.yaml, then:
neuroquant --config config.yaml --epochs 20

The pipeline writes everything to output_dir (default ./artifacts/):

artifacts/
├── checkpoints/          # per-phase resume points
├── onnx/                 # FP32 + per-method INT8 .onnx files
├── pareto/               # Pareto plots + JSON
├── reports/              # pipeline_report.txt, pareto_summary.json
├── reproducibility_manifest.json
├── latency_lut.json      # only when hardware_aware_search=true
└── pipeline_report.txt

Configuration

All knobs live in config.yaml. Common overrides:

model:
  name: resnet18              # any torchvision name
  num_classes: 10
  input_shape: [3, 32, 32]

dataset:
  name: cifar10               # cifar10 | cifar100 | imagefolder | synthetic | custom
  class: null                 # optional "pkg.module.MyDataset"
  train_dir: null             # optional ImageFolder split dirs
  val_dir: null
  test_dir: null
  batch_size: 128

methods: [ptq, qat, gptq, smoothquant, awq]
bitwidths:
  supported: [4, 8]
  io_layer: 8                 # force first/last layers to INT8

hyperparams:
  hardware_aware_search: true     # Wave 4 J4: 3-obj NSGA
  onnx_export_enabled: true       # Wave 4 J1/J2/J3
  qat_distill_alpha: 0.5          # Wave 2 E5: KD with FP32 teacher
  smoothquant_per_layer_alpha: true  # Wave 3 F3
  hessian_estimator: fisher       # Wave 3 B2: 3× faster than diag

Pydantic field validators run at load time — invalid values surface immediately with the offending field path:

ValueError: Configuration validation failed:
  num_classes must be >= 2.

Architecture

The framework was built in seven waves, each ending with a strict-format report. Per-wave architecture notes live in docs/architecture/:

Wave Theme Notes
1 Foundation (security + leakage) wave1.md
2 Real W+A QAT pipeline wave2.md
3 Method audits + Fisher wave3.md
4 ONNX + hardware-aware search wave4.md
5 Reporting + MLflow wave5.md
6 Config validation (Pydantic) wave6.md
7 Packaging + docs wave7.md

Quantization methods

Method When to use Module
PTQ Fast baseline; INT8 with bitwidth-aware calibration. quantization/ptq.py
QAT Best accuracy at INT8; requires fine-tuning data. quantization/qat.py
GPTQ Best accuracy at INT4 weights; data-aware optimal rounding. quantization/gptq.py
SmoothQuant Activation-friendly INT8; per-layer α grid search. quantization/smoothquant.py
AWQ INT4 with salient-channel preservation; per-layer α + FP16 carve-out. quantization/awq.py
SmoothQuant→GPTQ Production recipe — strict-Pareto improvement over either method alone. quantization/smoothquant_gptq.py
AdaRound Post-PTQ refinement; canonical input→output traversal. quantization/adaround.py

License

MIT. See LICENSE for the full text.


Acknowledgements

The seven-wave production hardening was specified, implemented, and refined in collaboration with Claude Opus 4.7 (1M context). Per-wave architecture notes live under docs/architecture/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuroquant-2.0.0.tar.gz (222.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuroquant-2.0.0-py3-none-any.whl (238.3 kB view details)

Uploaded Python 3

File details

Details for the file neuroquant-2.0.0.tar.gz.

File metadata

  • Download URL: neuroquant-2.0.0.tar.gz
  • Upload date:
  • Size: 222.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for neuroquant-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ddc636645d3ed6e641ed84d5cec1eb39a7663c110f870d5f9a30da31ea4f2387
MD5 817ceae052e364cdb71e99ce9a4a2f9b
BLAKE2b-256 34e99bef2e7b218d0d0618f3212aeb51c2446fca27b97dc3edc9f9c7b0779251

See more details on using hashes here.

File details

Details for the file neuroquant-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: neuroquant-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 238.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for neuroquant-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 57ac5a6d5abe4ca360346a44d2b6909aa077c73c9d4ae45982a8feb912dd5d01
MD5 712a0f25e7289439a4698fedb7eed9a7
BLAKE2b-256 8720f2396524f46872731efef5d4575e6c3e6b11f56d16ef135df747d71f46be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page