Accelerate Model Deployment on WinML
Project description
WinML CLI
WinML CLI is a CLI toolkit to build portable, performant, and high-quality models for Windows ML. It covers the entire journey from pretrained model to on-device inference — export, optimization, quantization, compilation, and benchmarking — across all execution providers, regardless of silicon.
:dart: WinML CLI Is Right for You If
- You want to build models that run on any Windows device — Qualcomm, Intel, AMD, NVIDIA, or CPU
- You want to benchmark a model with one command — latency, throughput, and live hardware utilization
- You want to catch compatibility issues ahead of time — unsupported ops, shape mismatches, EP gaps
- You want deep insights into your model — I/O shapes, task mapping, operator coverage per EP
- You want a repeatable and traceable model building process — config-driven, inspectable at every stage
- You want AI agents to build and profile models for you — agent-ready skills for coding assistants
:desktop_computer: Supported Hardware
| Execution Provider | Hardware | Status | EP Flag | Device Flag |
|---|---|---|---|---|
| QNN | Qualcomm NPU (Snapdragon X Elite) | 🟢 Ready | --ep qnn |
--device npu |
| OpenVINO | Intel NPU (Meteor Lake / Lunar Lake) | 🟢 Ready | --ep openvino |
--device npu |
| VitisAI | AMD NPU (Ryzen AI) | 🟢 Ready | --ep vitisai |
--device npu |
| NvTensorRTRTX | NVIDIA discrete GPUs | 🔶 Planned | --ep nv_tensorrt_rtx |
--device gpu |
| MIGraphX | AMD discrete GPUs | 🔶 Planned | --ep migraphx |
--device gpu |
| Dml | Hardware-agnostic GPU backend | 🔶 Planned | --ep dml |
--device gpu |
| CPU | Cross-platform fallback | ⚪ Always available | --ep cpu |
--device cpu |
Tip: Use
--device autoand WinML CLI picks the best available device — NPU first, then GPU, then CPU.
:clipboard: Prerequisites
Required Software
| Component | How to Get It |
|---|---|
| Windows 11 (x64 or ARM64) | Windows 11 24H2+ required for NPU support |
| UV | Install UV |
| WinML CLI (Python wheel) | Releases |
Required Hardware
WinML CLI targets NPU. We recommend testing on one of the following NPU devices:
| Device | EP | Flag |
|---|---|---|
| Snapdragon X Elite (Qualcomm) | QNN | --ep qnn --device npu |
| Intel AI Boost (Meteor Lake / Lunar Lake) | OpenVINO | --ep openvino --device npu |
| AMD Ryzen AI (Phoenix / Hawk Point / Strix) | VitisAI | --ep vitisai --device npu |
No NPU? Use --device auto — WinML CLI will fall back to the best available device (GPU → CPU). Note that winml compile requires NPU and cannot run without one.
Accepted Inputs
- HuggingFace model ID (e.g.,
microsoft/resnet-50) — weights are downloaded on first run - Local ONNX file (e.g.,
model.onnx) — fromwinml export,winml build, or any ONNX you already have
The Golden Rule: Inspect First
Before running any pipeline command, always verify the model is supported:
winml inspect -m <model-id>
If inspect prints an error or shows Unsupported, skip that model. Only models that pass inspect are valid inputs for export, analyze, build, perf, and eval.
:package: Installation
WinML CLI requires Python 3.11 and is distributed as a Python wheel. We recommend uv for fast, reproducible environment setup.
1. Create a Python 3.11 environment
uv venv --python 3.11
Activate it:
# Windows (PowerShell)
.venv\Scripts\activate
# Windows (Git Bash / WSL)
source .venv/Scripts/activate
2. Install from wheel
uv pip install winml_cli-<version>-py3-none-any.whl
3. Verify your environment
winml sys --list-device --list-ep
Confirm that your target device and EP appear in the output:
- Snapdragon X Elite — look for
QNNExecutionProvider - Intel AI Boost — look for
OpenVINOExecutionProvider - AMD Ryzen AI — look for
VitisAIExecutionProvider
If no NPU is detected, you can still use WinML CLI with --device auto for most commands. The only exception is winml compile, which requires an NPU device.
:wrench: Commands
| Category | Commands | Purpose |
|---|---|---|
| Primitives | inspect export optimize quantize compile |
Single-stage building blocks |
| Pipeline | config build perf eval run* |
End-to-end orchestration |
| Insights | analyze debug* |
Diagnostics and compatibility |
| Utilities | hub cache* doctor* setting* sys |
Catalog, cache, and environment |
* = coming soon
Primitives — one stage at a time
winml inspect — Discover model metadata. Prints the task, model class, input/output tensor names and shapes, and execution provider compatibility. No weights are loaded — this reads only the model configuration, making it fast and lightweight. Always run inspect first to verify a model is supported.
winml export — Convert a source model to ONNX. Takes a Hugging Face model ID (or local checkpoint) and produces a standards-compliant ONNX file with hierarchy-preserving metadata.
winml optimize — Fuse operators, simplify graphs, and prepare for target EPs. Takes an ONNX model and an optimization config (typically generated by winml analyze) and applies graph-level transformations: operator fusion, constant folding, shape inference, and EP-specific rewrites.
winml quantize — Compress to low-bit precision. Reduces model size and inference latency by converting weights and activations from FP32 to INT8 (or other low-bit formats). After quantization, the model is portable — it can run on any ONNX Runtime backend.
winml compile — Generate device-specific binaries. Takes a quantized ONNX model and produces EP-specific compiled artifacts (for example, QNN context binaries for Qualcomm NPU). This step locks the model to a specific device but delivers the lowest possible inference latency.
Pipeline — orchestrated workflows
winml config — Auto-detect optimal settings into a JSON config. Inspects the model and generates a complete build specification: task, I/O shapes, optimization flags, quantization parameters, and target EP settings. The config file is reviewable, editable, and version-controllable — the single source of truth for your build.
winml build — Orchestrate the full pipeline. Takes a config file and executes every stage in sequence: export, analyze, optimize, quantize, and compile. Two commands (config + build) replace eight manual steps.
winml perf — Benchmark latency, throughput, and hardware utilization. Runs inference on the target device and reports latency percentiles (p50, p90, p99), throughput (inferences per second), and optionally live hardware monitoring (CPU, RAM, NPU utilization) with the --monitor flag. Can accept a local ONNX file or a Hugging Face model ID.
winml eval — Measure model accuracy against reference datasets. Compares the output of your optimized/quantized model against the original to quantify any accuracy loss introduced by the pipeline.
winml run — End-to-end inference with pre/post processing. (Coming soon.)
Insights — understand what is happening inside
winml analyze — Lint operators, check EP compatibility, and generate optimization config. The analyzer has two components: the Linter (like ESLint for ONNX) checks every operator against target EPs and classifies each as supported, partial, or unsupported. AutoConf detects suboptimal patterns and generates the optimization config that the optimizer consumes. Together they form the analyze-optimize loop.
winml debug — Interactive model debugging and layer-by-layer inspection. (Coming soon.)
Utilities — catalog, cache, and environment
winml catalog — Browse the curated built-in model catalog.
winml cache — Manage built model artifacts and pipeline outputs. View, clean, or selectively remove cached models and intermediate files.
winml doctor — Diagnose environment issues. Checks runtimes, execution providers, and dependencies to identify configuration problems.
winml setting — Configure WinML CLI preferences. Set default EPs, output directories, and other global options.
winml sys — System information and capability reporting. Prints detected hardware, available EPs, Python version, and installed package versions.
:rocket: Quick Start
Inspect a Model
The fastest way to get started is to inspect a model. Let's look at ResNet-50:
winml inspect -m microsoft/resnet-50
This prints the model's metadata without downloading weights:
- Task:
image-classification— what the model does - Model class:
ResNetForImageClassification— the architecture - Input tensors: names, data types, and shapes (e.g.,
pixel_values: float32 [1, 3, 224, 224]) - Output tensors: names, data types, and shapes (e.g.,
logits: float32 [1, 1000])
If inspect succeeds, the model is supported and you can proceed with the rest of the pipeline.
Golden rule: always inspect first. Before running export, build, perf, or any other pipeline command, verify the model is supported with
winml inspect.
Build with Primitive Commands
This walkthrough builds ConvNeXT (facebook/convnext-base-224) step by step using primitive commands. ConvNeXT is a family of CNN models inspired by Vision Transformers, introduced by Meta in 2022 — it offers high accuracy while retaining the efficiency of CNNs.
Phase 1: Inspect
winml inspect -m facebook/convnext-base-224
Phase 2: Build a Portable Model
Export from PyTorch to ONNX:
winml export -m facebook/convnext-base-224 -o convnext/model.onnx -v
Analyze for EP compatibility:
winml analyze -m convnext/model.onnx --optim-config optim.json
Optimize the graph using the analyzer's config:
winml optimize -m convnext/model.onnx -c optim.json -o convnext/model_opt.onnx
Quantize to INT8:
winml quantize -m convnext/model_opt.onnx -o convnext/model_opt_int8.onnx
Phase 3: Benchmark on Device
Compile for NPU (generates device-specific binaries):
winml compile -m convnext/model_opt_int8.onnx --ep qnn -o convnext/model_compiled.onnx
Benchmark on NPU — note the latency:
winml perf -m convnext/model_compiled.onnx --ep qnn --iterations 100
Benchmark on CPU for comparison:
winml perf -m convnext/model_opt.onnx --ep cpu --iterations 100
Compare the two numbers to see the performance difference between NPU and CPU inference.
Build with Config + Build
Same model, different approach. Instead of running each command manually, use the config-driven pipeline. Think of it like CMake: config generates a build plan, build executes it.
Generate the build config:
winml config -m facebook/convnext-base-224 -o convnext_config.json
This creates a JSON file containing all settings for every pipeline step — task, I/O shapes, optimization flags, quantization parameters — all auto-detected from the model.
Build the model:
winml build -c convnext_config.json -m facebook/convnext-base-224 -o convnext_build/
This orchestrates the full pipeline — export, analyze, optimize, quantize, compile — all in one go. Same result as the manual steps above, but in two commands.
Benchmark the result:
winml perf -m convnext_build/model.onnx --ep qnn --iterations 100
The config file is the single source of truth for your build. Version-control it, share it with teammates, edit it to override settings, and replay builds deterministically on any machine.
Benchmark in One Command
The simplest way to evaluate a model — one command, zero setup:
winml perf -m facebook/convnext-base-224 --device npu --monitor
WinML CLI handles everything behind the scenes: download the model from Hugging Face, export to ONNX, optimize the graph, and run the benchmark on your NPU. The --monitor flag enables live hardware monitoring — real-time CPU utilization, RAM usage, and NPU activity alongside the latency results.
This is ideal for quick smoke tests: does the model run on this device, and how fast is it?
:arrows_counterclockwise: The BYOM Workflow
The Build Your Own Model (BYOM) workflow is the philosophy behind WinML CLI. It defines how a source model becomes a production-ready, device-optimized artifact.
The Pipeline
Source Model --> Export --> Analyze --> Optimize --> Quantize --> Compile --> Benchmark
Each arrow is a WinML CLI command. You can enter the pipeline at any stage (for example, start with a local ONNX file and skip export), exit early (stop after optimization if you do not need quantization), or loop back to repeat a stage with different settings.
Primitive Commands vs. Config-Driven Pipeline
| Primitive Commands | Config-Driven Pipeline | |
|---|---|---|
| Steps | One command per stage | Two steps: config + build |
| Control | Start from any stage; try different settings to fix errors or tweak performance | Repeatable, tweakable, version-controllable |
| Best for | Flexible workflow | Production-ready delivery |
| When to use | Exploring, debugging, prototyping | CI/CD, batch builds, team workflows |
| Lifecycle | "Coding" phase | Polish |
:clipboard: Built-in Models
Run winml catalog to browse the full catalog interactively.
Click to expand the full model catalog
| Model ID | Task | Architecture |
|---|---|---|
microsoft/resnet-50 |
image-classification | ResNet |
google/vit-base-patch16-224 |
image-classification | ViT |
microsoft/swin-large-patch4-window7-224 |
image-classification | Swin |
facebook/convnext-tiny-224 |
image-classification | ConvNeXT |
rizvandwiki/gender-classification |
image-classification | ViT |
ProsusAI/finbert |
text-classification | BERT |
Intel/bert-base-uncased-mrpc |
text-classification | BERT |
cardiffnlp/twitter-roberta-base-sentiment-latest |
text-classification | RoBERTa |
dslim/bert-base-NER |
token-classification | BERT |
dbmdz/bert-large-cased-finetuned-conll03-english |
token-classification | BERT |
Babelscape/wikineural-multilingual-ner |
token-classification | BERT |
w11wo/indonesian-roberta-base-posp-tagger |
token-classification | RoBERTa |
microsoft/table-transformer-detection |
object-detection | Table Transformer |
mattmdjaga/segformer_b2_clothes |
image-segmentation | SegFormer |
nvidia/segformer-b1-finetuned-ade-512-512 |
image-segmentation | SegFormer |
nvidia/segformer-b2-finetuned-ade-512-512 |
image-segmentation | SegFormer |
nvidia/segformer-b5-finetuned-ade-640-640 |
image-segmentation | SegFormer |
These models are verified against WinML CLI's full pipeline and serve as reliable starting points. You are not limited to this list — any Hugging Face model that passes winml inspect is a valid input.
For models not in this table, run winml inspect -m <model-id> to verify support before proceeding.
:warning: Scope & Limitations
What WinML CLI supports
WinML CLI targets classic deep learning models — CNNs, encoders, vision transformers, NLP classifiers, token classifiers, object detection models, and segmentation models.
Supported tasks include:
- Image classification (ResNet, ViT, Swin, ConvNeXT)
- Text classification (BERT, RoBERTa)
- Token classification / NER (BERT, RoBERTa)
- Object detection (Table Transformer)
- Image segmentation (SegFormer)
What WinML CLI does not support
LLMs and generative models are not in scope. Do not use WinML CLI with GPT, LLaMA, Phi, Mistral, Stable Diffusion, or any model with a decoder-only or sequence-to-sequence generative architecture. LLM support (with LoRA) is planned for Q3-Q4 2026.
Known constraints
winml compilerequires an NPU device. If no NPU is available, skip the compile step and use--device autofor benchmarking.- Some models may export successfully but fail during optimization or quantization due to unsupported operator patterns. The analyzer will flag these issues.
- Performance numbers vary by device, driver version, and EP version. Always benchmark on your target hardware.
:world_map: Roadmap
| Milestone | Target | Highlights |
|---|---|---|
| 🟡 Kickoff | Q4 2025 | Internal prototype, core primitive commands |
| 🟢 Early Access | Q1 2026 | First external testers, config + build pipeline, hub catalog |
| 🔵 Public Beta | Q2 2026 | Open source, agent skills, Foundry Toolkit integration |
| 🟣 RC | Q3-Q4 2026 | LLM support (with LoRA), broader device coverage, MLIR |
Click to expand roadmap details
Q4 2025 — Kickoff
- Primitive commands:
inspect,export,optimize,quantize,compile - QNN, OpenVINO, and VitisAI execution provider support
- Internal validation with ResNet, BERT, ViT, SegFormer families
Q1 2026 — Early Access
- Pipeline commands:
config,build,perf,eval - Analyzer with auto-configuration loop
- Built-in model catalog (
winml catalog) - Live hardware monitoring (
--monitor)
Q2 2026 — Public Beta
- Open source release
- Agent-ready skills for coding assistants (Claude Code, Cursor, Copilot)
- Foundry Toolkit for VS Code integration
Q3-Q4 2026 — Release Candidate
- LLM support (decoder-only architectures with LoRA adapters)
- NvTensorRTRTX, MIGraphX, and Dml execution providers
- MLIR-based optimization backend
- Public SDK and framework APIs
:lock: Data / Telemetry
Official WinML CLI releases can collect anonymous usage telemetry to
help improve the product. Telemetry is classified as Optional. A
one-time prompt on your first run asks for consent (default: accept —
press Enter to enable, type n to decline).
Dev installs (pip install -e . or running from a source checkout)
never send telemetry.
Control — edit %USERPROFILE%\.winml\config.json:
- Set
telemetry.consentto"disabled"to opt out - Set
telemetry.consentto"enabled"to opt in - Delete the file to re-show the first-run prompt on the next run
Telemetry is automatically disabled in CI / non-TTY environments regardless of the stored decision.
See docs/Privacy.md for the full list of what is and is not collected, event schemas, CI auto-disable behavior, and storage locations.
:handshake: Contributions and Feedback
We welcome contributions! Please see the contribution guidelines.
For feature requests or bug reports, please file a GitHub Issue.
:balance_scale: Code of Conduct
See CODE_OF_CONDUCT.md.
:page_facing_up: License
This project is licensed under the MIT License.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file winml_cli-0.1.0.tar.gz.
File metadata
- Download URL: winml_cli-0.1.0.tar.gz
- Upload date:
- Size: 10.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
270c0076d16d3ca7cc6027621f98401945e0e6f8b3f96aaf39ce0c5fca9c07b2
|
|
| MD5 |
03558626aad17ab8f0c16a7464f649cd
|
|
| BLAKE2b-256 |
5db87496315175edd500855a997c92ccea5423928eb2804a5c26fc1c7cfc8ae6
|
File details
Details for the file winml_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: winml_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ba80ce3c6420929d08f70012062012ed9057f9b4d30ccce8f1569c33de8e273
|
|
| MD5 |
1fec00c4762fb25d1d44abf0f34f41ea
|
|
| BLAKE2b-256 |
815a545d92de94a1c53ee9c72494dbc85efda7f0090ad055d7ce942030e5c90f
|