Skip to main content

ISAT: Inference Stack Auto-Tuner — 55-command CLI to auto-tune, profile, prune, deploy, and monitor ONNX models on any GPU. By Sudheer Ibrahim Daniel Devu.

Project description

ISAT -- Inference Stack Auto-Tuner

PyPI version PyPI downloads Python 3.9+ License: Apache 2.0 GitHub stars GitHub release

By Sudheer Ibrahim Daniel Devu

One command to find the fastest inference configuration for any ONNX model on any GPU.

ISAT is a 55-command production-grade CLI toolkit for ONNX inference optimization. It jointly searches across memory strategy, kernel backend, precision, graph transforms, batch size, and thread tuning -- then benchmarks each combination with thermal-aware cooldowns, statistical rigor, and Bayesian optimization.

pip install isat-tuner
isat tune model.onnx --profile cloud

Why ISAT?

Deploying an ONNX model today means manually tweaking dozens of settings:

Setting Choices Impact
HSA_XNACK 0 or 1 Up to 30% on APUs
MIGRAPHX_DISABLE_MLIR 0 or 1 5-15% GEMM performance
MIGRAPHX_SET_GEMM_PROVIDER default, rocblas, hipblaslt 10-25% on GEMM-heavy models
Precision FP32, FP16, INT8 2-4x throughput
Batch size 1 to 256 Linear throughput scaling
Graph optimization level 0-99 5-20% latency reduction
Inter/intra op threads 1 to N CPU-side parallelism

A single wrong choice can leave 40%+ performance on the table. With 6 dimensions and 4+ choices each, there are thousands of combinations. Nobody has time to test them all manually.

ISAT does it automatically.


All 55 Commands

Auto-Tuning & Search

Command What it does
isat tune Auto-tune an ONNX model across all search dimensions
isat profiles List available tuning profiles (edge, cloud, latency, etc.)
isat init Generate a default isat.yaml config file
isat batch Find optimal batch size (latency vs throughput tradeoff)
isat shapes Benchmark model across dynamic input shapes

Model Analysis & Inspection

Command What it does
isat inspect Deep fingerprint a model without benchmarking
isat diff Structural diff between two ONNX models
isat fusion Analyze operator fusion (fused vs unfused ops)
isat attention Profile attention heads in transformer models
isat weight-sharing Detect shared/tied weights across layers
isat visualize Visualize ONNX graph (DOT, ASCII, histogram)
isat scan Security and compliance scan of ONNX model
isat compat-matrix Operator compatibility across providers

Benchmarking & Profiling

Command What it does
isat profile Decompose latency into load/compile/inference phases
isat llm-bench LLM token throughput (TPS, TTFT, ITL with P95)
isat compiler-compare Benchmark same model across ALL execution providers
isat stress Sustained/burst/ramp stress testing
isat leak-check Detect memory leaks during inference
isat power Profile power efficiency (perf/watt, energy/inference)
isat thermal Thermal throttle detection during inference
isat gpu-frag GPU memory fragmentation analysis
isat warmup Analyze warmup behavior, find optimal iterations

Model Optimization

Command What it does
isat optimize Optimize ONNX model (simplify, quantize, export)
isat prune Prune model weights (magnitude/percentage/global)
isat surgery ONNX graph surgery (remove/rename/extract nodes)
isat quant-sensitivity Per-layer quantization sensitivity analysis
isat distill Knowledge distillation planning for teacher models

Production Deployment

Command What it does
isat serve Launch REST API server (FastAPI)
isat triton Generate Triton Inference Server config
isat canary Canary deployment between two model versions
isat ensemble Run model ensemble with aggregation
isat guard Validate inference inputs against model schema
isat codegen Generate standalone C++ inference code

Monitoring & Operations

Command What it does
isat alerts Inference alert rules engine (P99, error rate, GPU temp)
isat trace OpenTelemetry-compatible request tracing
isat drift Monitor output quality and detect confidence drift
isat regression Performance regression detection across versions
isat replay Record or replay inference requests

Planning & Cost

Command What it does
isat cost Estimate cloud inference cost
isat sla Validate inference against SLA requirements
isat recommend Hardware recommendation for a model
isat migrate Generate migration plan between providers
isat memory Estimate memory usage and predict OOM risk

Infrastructure & Utilities

Command What it does
isat hwinfo Print hardware fingerprint
isat doctor Pre-flight system health and compatibility check
isat history Show past tuning results from database
isat export Re-generate reports from database
isat compare Compare two configs with significance testing
isat abtest A/B test two models with statistical rigor
isat snapshot Capture environment state for reproducibility
isat cache Manage compilation cache (MIGraphX/ORT)
isat zoo List pre-tuned model configurations
isat download Download ONNX model by name or URL
isat registry Model version registry (register, promote, diff)
isat pipeline Profile multi-model inference pipeline

Installation

# From PyPI
pip install isat-tuner

# From GitHub (latest)
pip install git+https://github.com/SID-Devu/isat-tuner.git

# With all optional features
pip install "isat-tuner[all]"

# Platform-specific
pip install "isat-tuner[rocm]"      # ROCm GPU support
pip install "isat-tuner[cuda]"      # NVIDIA CUDA support
pip install "isat-tuner[server]"    # REST API server
pip install "isat-tuner[bayesian]"  # Bayesian optimization (scipy)

# Development
git clone https://github.com/SID-Devu/isat-tuner.git
cd isat && pip install -e ".[dev,all]"

Quick Start

# One-command auto-tune
isat tune model.onnx --warmup 3 --runs 5 --cooldown 60

# Use a deployment profile
isat tune model.onnx --profile edge
isat tune model.onnx --profile cloud

# Bayesian optimization (smarter than grid search)
isat tune model.onnx --bayesian --max-configs 20

# Inspect model
isat inspect model.onnx

# Check your hardware
isat hwinfo

# System health check
isat doctor

# LLM token benchmarking
isat llm-bench model.onnx --seq-lengths 32,64,128,256

# Compare across all available providers
isat compiler-compare model.onnx

# Prune a model
isat prune model.onnx --strategy magnitude --sparsity 0.5

# Analyze operator fusion
isat fusion model.onnx

# Generate C++ inference code
isat codegen model.onnx --output-dir cpp_build/

# Canary deployment (safe model rollout)
isat canary baseline.onnx candidate.onnx

# Monitor output drift
isat drift model.onnx

# Graph surgery (remove Identity/Dropout nodes)
isat surgery model.onnx --remove-op Identity --remove-op Dropout

# Launch REST API
isat serve --port 8000

Search Dimensions

1. Memory Strategy

Config Environment When to use
xnack0_default HSA_XNACK=0 Discrete GPUs, no demand paging
xnack1_default HSA_XNACK=1 APUs, unified memory
xnack1_coarse_grain XNACK=1 + coarse-grain Large models on APU
xnack1_oversubscribe XNACK=1 + queue limit Models exceeding VRAM

2. Kernel Backend

Config Environment When to use
mlir_default (default) General-purpose, fused kernels
rocblas_explicit MIGRAPHX_DISABLE_MLIR=1 GEMM-heavy models
hipblaslt_explicit MIGRAPHX_SET_GEMM_PROVIDER=hipblaslt Latest GEMM tuning

3. Precision

Config Method Typical speedup
fp32_native Original Baseline
fp16_migraphx MIGraphX built-in 1.5-2x
int8_qdq ORT static quantization 2-4x

4. Graph Transforms

Config Transform Effect
raw_opt99 None + full ORT opt Default
sim_opt99 onnxsim + full ORT opt Remove dead ops
pinned_opt99 Freeze dynamic dims Better kernel selection

5. Batch Size

Auto-explores powers of 2 up to GPU memory limit.

6. Thread Tuning

Explores inter/intra thread counts and sequential vs parallel execution modes.

7. Execution Provider (Multi-Platform)

Auto-detects available providers: MIGraphX, CUDA, TensorRT, OpenVINO, ROCm, DirectML, CPU.


Deployment Profiles

Profile Warmup Runs Cooldown Priority Use case
edge 3 10 30s Latency IoT, mobile, embedded
cloud 5 20 120s Throughput Serving, batch processing
latency 5 30 60s P99 Real-time inference
throughput 3 15 120s FPS Max batch throughput
power 3 10 60s Perf/watt Battery, thermal-constrained
quick 1 3 15s Latency Fast exploration
exhaustive 5 50 180s Latency Leave no stone unturned
apu 3 10 60s Latency APU-specific optimization

Output & Reports

File Description
isat_report.html Interactive HTML dashboard
isat_report.json Machine-readable results for automation
best_config.sh Shell script -- source it to apply best env vars
isat_results.db SQLite database of all historical results
config.pbtxt Triton Inference Server config
isat.prom Prometheus metrics
traces_*.json OpenTelemetry-compatible trace export
isat_inference.cpp Generated C++ inference code

REST API Server

isat serve --port 8000
Endpoint Method Description
/api/v1/tune POST Submit a tuning job
/api/v1/jobs GET List all jobs
/api/v1/jobs/{id} GET Get job status + results
/api/v1/jobs/{id}/report GET Get JSON report
/api/v1/jobs/{id}/report/html GET Get HTML dashboard
/api/v1/inspect POST Fingerprint a model
/api/v1/hardware GET Get hardware fingerprint
/api/v1/history GET Query historical results
/health GET Health check

Docker

docker-compose up -d

# Or standalone
docker build -t isat .
docker run --device /dev/kfd --device /dev/dri --group-add video \
  -v ./models:/models isat tune /models/model.onnx

Using as a Library

from isat.fingerprint import fingerprint_hardware, fingerprint_model
from isat.search import SearchEngine
from isat.benchmark import BenchmarkRunner
from isat.analysis import ParetoFrontier
from isat.pruning.pruner import ModelPruner
from isat.fusion.analyzer import FusionAnalyzer
from isat.guard.validator import InputGuard
from isat.inference_cache.cache import InferenceCache

# Auto-tune
hw = fingerprint_hardware()
model = fingerprint_model("model.onnx")
engine = SearchEngine(hw, model, warmup=3, runs=5, cooldown=60)
candidates = engine.generate_candidates()

# Prune a model
pruner = ModelPruner("model.onnx")
result = pruner.prune(strategy="magnitude", sparsity=0.5)

# Analyze fusion
analyzer = FusionAnalyzer("model.onnx")
report = analyzer.analyze()

# Validate inputs before inference
guard = InputGuard(model_path="model.onnx")
result = guard.validate({"input": my_tensor})

# Cache inference results
cache = InferenceCache(max_memory_entries=1000, disk_cache_dir="./cache")

CI/CD Integration

# Fail CI if latency > 50ms or throughput < 100 fps
isat tune model.onnx --gate-latency 50 --gate-throughput 100
echo $?  # 0 = pass, 1 = fail

Architecture

isat/
├── cli.py                 # 55 subcommands
├── fingerprint/           # Hardware + model fingerprinting
├── search/                # 7-dimension search engine + Bayesian optimization
├── benchmark/             # Runner, stats, thermal monitoring, multi-GPU
├── analysis/              # Outliers, significance, Pareto, regression
├── pruning/               # Magnitude/percentage/global weight pruning
├── distillation/          # Knowledge distillation planning
├── fusion/                # Operator fusion analysis
├── attention/             # Transformer attention head profiling
├── surgery/               # ONNX graph surgery (remove/rename/extract)
├── guard/                 # Input validation and schema enforcement
├── ensemble/              # Multi-model ensemble with aggregation
├── canary/                # Canary deployment with auto-rollback
├── alerts/                # Alert rules engine (P99, error rate, temp)
├── tracing/               # OpenTelemetry-compatible request tracing
├── inference_cache/       # LRU + disk inference result caching
├── replay/                # Record and replay inference requests
├── output_monitor/        # Confidence drift detection (KS test)
├── llm_bench/             # LLM token throughput (TPS, TTFT, ITL)
├── compiler_compare/      # Cross-provider benchmark comparison
├── codegen/               # ONNX to C++ code generator
├── weight_analysis/       # Weight sharing detection
├── continuous_profiler/   # Always-on production profiling
├── gpu_frag/              # GPU memory fragmentation analysis
├── batching/              # Dynamic request batching engine
├── scanner/               # ONNX security/compliance scanner
├── compat_matrix/         # Operator compatibility matrix
├── thermal/               # Thermal throttle detection
├── quant_sensitivity/     # Per-layer quantization sensitivity
├── pipeline/              # Multi-model pipeline optimizer
├── recommend/             # Hardware recommendation engine
├── registry/              # Model version registry
├── regression/            # Performance regression detector
├── optimizer/             # Graph transforms + quantization
├── profiler/              # Latency decomposition
├── cost/                  # Cloud cost estimation
├── sla/                   # SLA validation
├── memory/                # Memory planning + OOM prediction
├── power/                 # Power efficiency profiling
├── health/                # System health checks
├── cache/                 # Compilation cache management
├── migration/             # Provider migration planning
├── warmup/                # Warmup analysis
├── shapes/                # Dynamic shape benchmarking
├── hub/                   # Model download from HuggingFace/ONNX Zoo
├── scheduler/             # Adaptive batch scheduling
├── snapshot/              # Environment snapshotting
├── abtesting/             # A/B testing framework
├── visualizer/            # Graph visualization (DOT, ASCII)
├── stress/                # Stress testing + memory leak detection
├── notifications/         # Webhook, Slack, console notifications
├── server/                # FastAPI REST API
├── integrations/          # Triton, Prometheus, CI/CD
├── database/              # SQLite results database
├── report/                # JSON, HTML, console reports
├── config/                # YAML/JSON config loader
├── profiles/              # 8 deployment profiles
├── model_zoo.py           # Pre-tuned model configurations
├── plugins.py             # Plugin system with lifecycle hooks
├── retry.py               # Exponential backoff retry logic
└── utils/                 # sysfs, rocm, onnx utilities

Requirements

  • Python >= 3.9
  • onnxruntime (CPU), onnxruntime-rocm (ROCm), or onnxruntime-gpu (CUDA)
  • onnx, numpy

Optional: scipy, fastapi, uvicorn, onnxsim, prometheus-client, pyyaml, jinja2


Version History

Version Date Highlights
v0.7.x Apr 2026 Pruning, distillation, fusion analysis, LLM bench, compiler comparison, replay, drift monitor, codegen (55 commands)
v0.6.0 Apr 2026 Tracing, canary deploy, alerts, graph surgery, caching, input guard, ensemble, GPU frag (45 commands)
v0.5.0 Apr 2026 Regression detector, security scanner, compat matrix, thermal monitor, quant sensitivity, pipeline optimizer, HW recommender, model registry (38 commands)
v0.4.0 Apr 2026 Dynamic shapes, model hub, power profiler, memory planner, A/B testing, graph visualizer, env snapshot, batch scheduler (30 commands)
v0.3.0 Apr 2026 Latency profiler, cost estimator, SLA validator, health checker, migration tool, notifications (22 commands)
v0.2.0 Apr 2026 Config system, model optimization, stress testing, plugin system, model zoo (14 commands)
v0.1.0 Apr 2026 Initial release: auto-tuning, Bayesian search, multi-provider support (9 commands)

Citation

@software{isat_tuner,
  author = {Sudheer Ibrahim Daniel Devu},
  title = {ISAT: Inference Stack Auto-Tuner},
  year = {2026},
  version = {0.7.2},
  url = {https://github.com/SID-Devu/isat-tuner},
  license = {Apache-2.0}
}

License

Apache 2.0 -- see LICENSE

Copyright 2026 Sudheer Ibrahim Daniel Devu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isat_tuner-0.7.2.tar.gz (186.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isat_tuner-0.7.2-py3-none-any.whl (203.3 kB view details)

Uploaded Python 3

File details

Details for the file isat_tuner-0.7.2.tar.gz.

File metadata

  • Download URL: isat_tuner-0.7.2.tar.gz
  • Upload date:
  • Size: 186.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isat_tuner-0.7.2.tar.gz
Algorithm Hash digest
SHA256 892b8cb4851dedb107e763cef81c3cfc8f9bcfd98233b7e5a64b58b66bf63ecd
MD5 d93597f38476f3c84cb38995dd4e9b30
BLAKE2b-256 08b3168894d18979d6af79288f5919abbcbb1c6fe740f76724f0b5732836cb6d

See more details on using hashes here.

File details

Details for the file isat_tuner-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: isat_tuner-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 203.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isat_tuner-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7423932faf86ef2059c35cd969baf2e9620ba6dcbd86399214a65bee4016d10
MD5 1746d8b018c7b58baf7762e234440432
BLAKE2b-256 6067f975ab6faf5edd3ed0887528ef49506edfaf531efed37e3c504ea1ce22fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page