Skip to main content

ISAT: Inference Stack Auto-Tuner — CLI toolkit to auto-tune, profile, prune, deploy, and monitor ONNX models on any GPU.

Project description

ISAT -- Inference Stack Auto-Tuner

PyPI version PyPI downloads Python 3.9+ License: Apache 2.0 GitHub stars GitHub release

One command to detect your hardware, recommend the optimal setup, and auto-tune any ONNX model -- on any GPU from any vendor.

ISAT is a production-grade CLI toolkit for ONNX inference optimization. It auto-detects your hardware (AMD, NVIDIA, Intel, Apple, Qualcomm), classifies it (iGPU/dGPU/APU/SoC), and generates copy-paste-ready inference configurations. Then it jointly searches across memory strategy, kernel backend, precision, graph transforms, batch size, and thread tuning -- benchmarking each combination with thermal-aware cooldowns, statistical rigor, and Bayesian optimization.

pip install isat-tuner

# Detect your hardware and get instant recommendations:
isat tune

# Detect + recommend + auto-tune a specific model:
isat tune model.onnx

# Full production tuning with cloud profile:
isat tune model.onnx --profile cloud

Install note: On modern Linux (Ubuntu 23.04+, Debian 12+), bare pip install is blocked by PEP 668. Use pipx install isat-tuner instead -- it creates an isolated environment and puts isat on your PATH automatically. If you don't have pipx: sudo apt install pipx && pipx ensurepath.


Why ISAT?

Deploying an ONNX model today means manually tweaking dozens of settings:

Setting Choices Impact
HSA_XNACK 0 or 1 Up to 30% on APUs
MIGRAPHX_DISABLE_MLIR 0 or 1 5-15% GEMM performance
MIGRAPHX_SET_GEMM_PROVIDER default, rocblas, hipblaslt 10-25% on GEMM-heavy models
Precision FP32, FP16, INT8 2-4x throughput
Batch size 1 to 256 Linear throughput scaling
Graph optimization level 0-99 5-20% latency reduction
Inter/intra op threads 1 to N CPU-side parallelism

A single wrong choice can leave 40%+ performance on the table. With 6 dimensions and 4+ choices each, there are thousands of combinations. Nobody has time to test them all manually.

ISAT does it automatically.


All 55 Commands

Auto-Tuning & Search

Command What it does
isat tune Auto-detect hardware + recommend + tune (works with or without a model)
isat profiles List available tuning profiles (edge, cloud, latency, etc.)
isat init Generate a default isat.yaml config file
isat batch Find optimal batch size (latency vs throughput tradeoff)
isat shapes Benchmark model across dynamic input shapes

Model Analysis & Inspection

Command What it does
isat inspect Deep fingerprint a model without benchmarking
isat diff Structural diff between two ONNX models
isat fusion Analyze operator fusion (fused vs unfused ops)
isat attention Profile attention heads in transformer models
isat weight-sharing Detect shared/tied weights across layers
isat visualize Visualize ONNX graph (DOT, ASCII, histogram)
isat scan Security and compliance scan of ONNX model
isat compat-matrix Operator compatibility across providers

Benchmarking & Profiling

Command What it does
isat profile Decompose latency into load/compile/inference phases
isat llm-bench LLM token throughput (TPS, TTFT, ITL with P95)
isat compiler-compare Benchmark same model across ALL execution providers
isat stress Sustained/burst/ramp stress testing
isat leak-check Detect memory leaks during inference
isat power Profile power efficiency (perf/watt, energy/inference)
isat thermal Thermal throttle detection during inference
isat gpu-frag GPU memory fragmentation analysis
isat warmup Analyze warmup behavior, find optimal iterations

Model Optimization

Command What it does
isat optimize Optimize ONNX model (simplify, quantize, export)
isat prune Prune model weights (magnitude/percentage/global)
isat surgery ONNX graph surgery (remove/rename/extract nodes)
isat quant-sensitivity Per-layer quantization sensitivity analysis
isat distill Knowledge distillation planning for teacher models

Production Deployment

Command What it does
isat serve Launch REST API server (FastAPI)
isat triton Generate Triton Inference Server config
isat canary Canary deployment between two model versions
isat ensemble Run model ensemble with aggregation
isat guard Validate inference inputs against model schema
isat codegen Generate standalone C++ inference code

Monitoring & Operations

Command What it does
isat alerts Inference alert rules engine (P99, error rate, GPU temp)
isat trace OpenTelemetry-compatible request tracing
isat drift Monitor output quality and detect confidence drift
isat regression Performance regression detection across versions
isat replay Record or replay inference requests

Planning & Cost

Command What it does
isat cost Estimate cloud inference cost
isat sla Validate inference against SLA requirements
isat recommend Hardware recommendation for a model
isat migrate Generate migration plan between providers
isat memory Estimate memory usage and predict OOM risk

Infrastructure & Utilities

Command What it does
isat hwinfo Print hardware fingerprint
isat doctor Pre-flight system health and compatibility check
isat history Show past tuning results from database
isat export Re-generate reports from database
isat compare Compare two configs with significance testing
isat abtest A/B test two models with statistical rigor
isat snapshot Capture environment state for reproducibility
isat cache Manage compilation cache (MIGraphX/ORT)
isat zoo List pre-tuned model configurations
isat download Download ONNX model by name or URL
isat registry Model version registry (register, promote, diff)
isat pipeline Profile multi-model inference pipeline

Installation

# From PyPI
pip install isat-tuner

# From GitHub (latest)
pip install git+https://github.com/SID-Devu/isat-tuner.git

# With all optional features
pip install "isat-tuner[all]"

# Platform-specific
pip install "isat-tuner[rocm]"      # ROCm GPU support
pip install "isat-tuner[cuda]"      # NVIDIA CUDA support
pip install "isat-tuner[server]"    # REST API server
pip install "isat-tuner[bayesian]"  # Bayesian optimization (scipy)

# Development
git clone https://github.com/SID-Devu/isat-tuner.git
cd isat && pip install -e ".[dev,all]"

Quick Start

# One-command auto-tune
isat tune model.onnx --warmup 3 --runs 5 --cooldown 60

# Use a deployment profile
isat tune model.onnx --profile edge
isat tune model.onnx --profile cloud

# Bayesian optimization (smarter than grid search)
isat tune model.onnx --bayesian --max-configs 20

# Inspect model
isat inspect model.onnx

# Check your hardware
isat hwinfo

# System health check
isat doctor

# LLM token benchmarking
isat llm-bench model.onnx --seq-lengths 32,64,128,256

# Compare across all available providers
isat compiler-compare model.onnx

# Prune a model
isat prune model.onnx --strategy magnitude --sparsity 0.5

# Analyze operator fusion
isat fusion model.onnx

# Generate C++ inference code
isat codegen model.onnx --output-dir cpp_build/

# Canary deployment (safe model rollout)
isat canary baseline.onnx candidate.onnx

# Monitor output drift
isat drift model.onnx

# Graph surgery (remove Identity/Dropout nodes)
isat surgery model.onnx --remove-op Identity --remove-op Dropout

# Launch REST API
isat serve --port 8000

Search Dimensions

1. Memory Strategy

Config Environment When to use
xnack0_default HSA_XNACK=0 Discrete GPUs, no demand paging
xnack1_default HSA_XNACK=1 APUs, unified memory
xnack1_coarse_grain XNACK=1 + coarse-grain Large models on APU
xnack1_oversubscribe XNACK=1 + queue limit Models exceeding VRAM

2. Kernel Backend

Config Environment When to use
mlir_default (default) General-purpose, fused kernels
rocblas_explicit MIGRAPHX_DISABLE_MLIR=1 GEMM-heavy models
hipblaslt_explicit MIGRAPHX_SET_GEMM_PROVIDER=hipblaslt Latest GEMM tuning

3. Precision

Config Method Typical speedup
fp32_native Original Baseline
fp16_migraphx MIGraphX built-in 1.5-2x
int8_qdq ORT static quantization 2-4x

4. Graph Transforms

Config Transform Effect
raw_opt99 None + full ORT opt Default
sim_opt99 onnxsim + full ORT opt Remove dead ops
pinned_opt99 Freeze dynamic dims Better kernel selection

5. Batch Size

Auto-explores powers of 2 up to GPU memory limit.

6. Thread Tuning

Explores inter/intra thread counts and sequential vs parallel execution modes.

7. Execution Provider (Multi-Platform)

Auto-detects available providers: MIGraphX, CUDA, TensorRT, OpenVINO, ROCm, DirectML, CPU.


Deployment Profiles

Profile Warmup Runs Cooldown Priority Use case
edge 3 10 30s Latency IoT, mobile, embedded
cloud 5 20 120s Throughput Serving, batch processing
latency 5 30 60s P99 Real-time inference
throughput 3 15 120s FPS Max batch throughput
power 3 10 60s Perf/watt Battery, thermal-constrained
quick 1 3 15s Latency Fast exploration
exhaustive 5 50 180s Latency Leave no stone unturned
apu 3 10 60s Latency APU-specific optimization

Output & Reports

File Description
isat_report.html Interactive HTML dashboard
isat_report.json Machine-readable results for automation
best_config.sh Shell script -- source it to apply best env vars
isat_results.db SQLite database of all historical results
config.pbtxt Triton Inference Server config
isat.prom Prometheus metrics
traces_*.json OpenTelemetry-compatible trace export
isat_inference.cpp Generated C++ inference code

REST API Server

isat serve --port 8000
Endpoint Method Description
/api/v1/tune POST Submit a tuning job
/api/v1/jobs GET List all jobs
/api/v1/jobs/{id} GET Get job status + results
/api/v1/jobs/{id}/report GET Get JSON report
/api/v1/jobs/{id}/report/html GET Get HTML dashboard
/api/v1/inspect POST Fingerprint a model
/api/v1/hardware GET Get hardware fingerprint
/api/v1/history GET Query historical results
/health GET Health check

Docker

docker-compose up -d

# Or standalone
docker build -t isat .
docker run --device /dev/kfd --device /dev/dri --group-add video \
  -v ./models:/models isat tune /models/model.onnx

Using as a Library

from isat.fingerprint import fingerprint_hardware, fingerprint_model
from isat.search import SearchEngine
from isat.benchmark import BenchmarkRunner
from isat.analysis import ParetoFrontier
from isat.pruning.pruner import ModelPruner
from isat.fusion.analyzer import FusionAnalyzer
from isat.guard.validator import InputGuard
from isat.inference_cache.cache import InferenceCache

# Auto-tune
hw = fingerprint_hardware()
model = fingerprint_model("model.onnx")
engine = SearchEngine(hw, model, warmup=3, runs=5, cooldown=60)
candidates = engine.generate_candidates()

# Prune a model
pruner = ModelPruner("model.onnx")
result = pruner.prune(strategy="magnitude", sparsity=0.5)

# Analyze fusion
analyzer = FusionAnalyzer("model.onnx")
report = analyzer.analyze()

# Validate inputs before inference
guard = InputGuard(model_path="model.onnx")
result = guard.validate({"input": my_tensor})

# Cache inference results
cache = InferenceCache(max_memory_entries=1000, disk_cache_dir="./cache")

CI/CD Integration

# Fail CI if latency > 50ms or throughput < 100 fps
isat tune model.onnx --gate-latency 50 --gate-throughput 100
echo $?  # 0 = pass, 1 = fail

Architecture

isat/
├── cli.py                 # 55 subcommands
├── fingerprint/           # Hardware + model fingerprinting
├── search/                # 7-dimension search engine + Bayesian optimization
├── benchmark/             # Runner, stats, thermal monitoring, multi-GPU
├── analysis/              # Outliers, significance, Pareto, regression
├── pruning/               # Magnitude/percentage/global weight pruning
├── distillation/          # Knowledge distillation planning
├── fusion/                # Operator fusion analysis
├── attention/             # Transformer attention head profiling
├── surgery/               # ONNX graph surgery (remove/rename/extract)
├── guard/                 # Input validation and schema enforcement
├── ensemble/              # Multi-model ensemble with aggregation
├── canary/                # Canary deployment with auto-rollback
├── alerts/                # Alert rules engine (P99, error rate, temp)
├── tracing/               # OpenTelemetry-compatible request tracing
├── inference_cache/       # LRU + disk inference result caching
├── replay/                # Record and replay inference requests
├── output_monitor/        # Confidence drift detection (KS test)
├── llm_bench/             # LLM token throughput (TPS, TTFT, ITL)
├── compiler_compare/      # Cross-provider benchmark comparison
├── codegen/               # ONNX to C++ code generator
├── weight_analysis/       # Weight sharing detection
├── continuous_profiler/   # Always-on production profiling
├── gpu_frag/              # GPU memory fragmentation analysis
├── batching/              # Dynamic request batching engine
├── scanner/               # ONNX security/compliance scanner
├── compat_matrix/         # Operator compatibility matrix
├── thermal/               # Thermal throttle detection
├── quant_sensitivity/     # Per-layer quantization sensitivity
├── pipeline/              # Multi-model pipeline optimizer
├── recommend/             # Hardware recommendation engine
├── registry/              # Model version registry
├── regression/            # Performance regression detector
├── optimizer/             # Graph transforms + quantization
├── profiler/              # Latency decomposition
├── cost/                  # Cloud cost estimation
├── sla/                   # SLA validation
├── memory/                # Memory planning + OOM prediction
├── power/                 # Power efficiency profiling
├── health/                # System health checks
├── cache/                 # Compilation cache management
├── migration/             # Provider migration planning
├── warmup/                # Warmup analysis
├── shapes/                # Dynamic shape benchmarking
├── hub/                   # Model download from HuggingFace/ONNX Zoo
├── scheduler/             # Adaptive batch scheduling
├── snapshot/              # Environment snapshotting
├── abtesting/             # A/B testing framework
├── visualizer/            # Graph visualization (DOT, ASCII)
├── stress/                # Stress testing + memory leak detection
├── notifications/         # Webhook, Slack, console notifications
├── server/                # FastAPI REST API
├── integrations/          # Triton, Prometheus, CI/CD
├── database/              # SQLite results database
├── report/                # JSON, HTML, console reports
├── config/                # YAML/JSON config loader
├── profiles/              # 8 deployment profiles
├── model_zoo.py           # Pre-tuned model configurations
├── plugins.py             # Plugin system with lifecycle hooks
├── retry.py               # Exponential backoff retry logic
└── utils/                 # sysfs, rocm, onnx utilities

Requirements

  • Python >= 3.9
  • onnxruntime (CPU), onnxruntime-rocm (ROCm), or onnxruntime-gpu (CUDA)
  • onnx, numpy

Optional: scipy, fastapi, uvicorn, onnxsim, prometheus-client, pyyaml, jinja2


Version History

Version Date Highlights
v0.7.x Apr 2026 Pruning, distillation, fusion analysis, LLM bench, compiler comparison, replay, drift monitor, codegen (55 commands)
v0.6.0 Apr 2026 Tracing, canary deploy, alerts, graph surgery, caching, input guard, ensemble, GPU frag (45 commands)
v0.5.0 Apr 2026 Regression detector, security scanner, compat matrix, thermal monitor, quant sensitivity, pipeline optimizer, HW recommender, model registry (38 commands)
v0.4.0 Apr 2026 Dynamic shapes, model hub, power profiler, memory planner, A/B testing, graph visualizer, env snapshot, batch scheduler (30 commands)
v0.3.0 Apr 2026 Latency profiler, cost estimator, SLA validator, health checker, migration tool, notifications (22 commands)
v0.2.0 Apr 2026 Config system, model optimization, stress testing, plugin system, model zoo (14 commands)
v0.1.0 Apr 2026 Initial release: auto-tuning, Bayesian search, multi-provider support (9 commands)

Citation

@software{isat_tuner,
  author = {Sudheer Ibrahim Daniel Devu},
  title = {ISAT: Inference Stack Auto-Tuner},
  year = {2026},
  version = {0.7.2},
  url = {https://github.com/SID-Devu/isat-tuner},
  license = {Apache-2.0}
}

License

Apache 2.0 -- see LICENSE

Copyright 2026 Sudheer Ibrahim Daniel Devu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isat_tuner-0.8.4.tar.gz (213.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isat_tuner-0.8.4-py3-none-any.whl (228.9 kB view details)

Uploaded Python 3

File details

Details for the file isat_tuner-0.8.4.tar.gz.

File metadata

  • Download URL: isat_tuner-0.8.4.tar.gz
  • Upload date:
  • Size: 213.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isat_tuner-0.8.4.tar.gz
Algorithm Hash digest
SHA256 7dac46e6c8f32c290a58afb6bf39baaa8d591f8de1e8ec963414d6daae3a1756
MD5 29cfe21af607d8d83f02abde90dab246
BLAKE2b-256 2c796756e0b2166266cc5944f5f41842e555ddefe50092051e40aba72acae2f6

See more details on using hashes here.

File details

Details for the file isat_tuner-0.8.4-py3-none-any.whl.

File metadata

  • Download URL: isat_tuner-0.8.4-py3-none-any.whl
  • Upload date:
  • Size: 228.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for isat_tuner-0.8.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bcb3a91d2264059c001ecd2cc8bde75ca82a4178d3926056b5f79448d5cbd4b6
MD5 cbcc163c26833c877793866b05099e35
BLAKE2b-256 30da3ab8328d4f429956937b4a52186353f1dbb158efcf609dccb8f1a9246ee0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page