Performance CI/CD for on-device ML models — catch inference regressions before they ship
Project description
MLBuild
Performance CI/CD for On-Device Production ML Models
MLBuild is the missing performance layer for on-device ML CI/CD. While MLflow, DVC, and W&B track training experiments, MLBuild enforces production SLAs — automatically benchmarking inference performance, validating against thresholds, blocking regressions in CI, and generating deployment-ready reports.
Current Status
| Feature | Status |
|---|---|
| Input formats | ONNX, TFLite, CoreML |
| Backends | CoreML, TFLite, ONNX Runtime |
| Storage | Local + S3-compatible (AWS S3, R2, B2) |
| Targets | Apple Silicon, A-series, Android (arm64) |
| Platform | macOS, Linux (TFLite) |
| Command history | Local, searchable, filterable by every command |
| Performance budget | Persistent constraints in .mlbuild/budget.toml |
| Baseline management | Reserved tag with clean CLI |
| Workspace status | Quick health snapshot |
The Problem
# Your CI passes
pytest ✓
black --check ✓
mypy ✓
# But in production
Latency: 8ms --> 15ms (88% slower)
Memory: 50MB --> 120MB (140% more)
Size: 6MB --> 10MB (67% larger)
# Nobody caught it until users complained
The gap: Existing tools don't validate production performance in CI.
The Solution
# Tag your main branch baseline once
mlbuild tag create <build_id> main-mobilenet
# Add one step to your CI pipeline
mlbuild ci --model model.onnx --baseline main-mobilenet
# Output:
# MLBuild CI Report
# ──────────────────────────────────────────────────
# Model: mobilenet
# Baseline: 3f36810e (main-mobilenet)
# Candidate: b8aa1ef6 (fp16)
#
# Baseline Candidate Delta
# Latency (p50) 2.49 ms 0.74 ms -70.27%
# Size 13.39 MB 6.74 MB -49.64%
#
# Result: ✓ PASS
# Exit code: 0
# Or use the low-level gate directly
mlbuild ci-check $BASELINE_ID $CANDIDATE_ID --latency-threshold 10
# Exit code: 1 — PR blocked on regression
Catch latency AND size regressions before they reach production.
Where MLBuild Fits
MLBuild is the missing on-device performance layer in your ML CI/CD stack.
┌─────────────────────────────────────────────────────────────────┐
│ ML Training │
│ ├── Experiment Tracking ──────────────── MLflow / W&B │
│ └── Data Versioning ──────────────────── DVC │
└─────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────────────┐
│ On-Device Optimization MLBuild │
│ ├── Model Packaging ──────────────── mlbuild build │
│ ├── Model Import ─────────────────── mlbuild import │
│ ├── Task Detection ───────────────── automatic │
│ ├── Performance Validation ───────── mlbuild benchmark │
│ ├── Quantization Benchmarking ────── mlbuild compare-quant │
│ └── Reporting ────────────────────── mlbuild report │
└─────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────────────┐
│ Regression Gate MLBuild CI │
│ ✕ Bad performance → blocks deployment │
│ ├── CI Performance Gate ─────────── mlbuild ci-check │
│ └── Full CI Orchestration ───────── mlbuild ci │
└─────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────────▼───────────────────────────────────┐
│ Deployment │
│ └── Release & Ship ───────────────── GitHub Actions / K8s │
└─────────────────────────────────────────────────────────────────┘
| Feature | MLflow / W&B / DVC | MLBuild |
|---|---|---|
| Track training experiments | Yes | No (use MLflow) |
| Automated p50/p95/p99 benchmarking | Manual | Built-in |
| CI fails on latency regression | Not native | mlbuild ci-check |
| CI fails on model size regression | Not native | --size-threshold |
| Task-aware synthetic inputs | No | Auto-detected |
| NLP multi-seq-len benchmarking | No | Built-in |
| Optimization sweep (fp16 + int8) | No | mlbuild explore |
| Static INT8 with calibration data | No | --calibration-data |
| Magnitude pruning (ONNX + CoreML) | No | mlbuild optimize --pass prune |
| Output divergence checking | No | mlbuild accuracy |
| Optimization chain visualization | No | mlbuild log --tree |
| Quantization tradeoff analysis | No | mlbuild compare-quantization |
| Performance reports | No | mlbuild report |
| S3-compatible remote storage | No | Built-in |
| TFLite benchmarking | No | Built-in |
| Import pre-built models | No | mlbuild import |
MLBuild complements your existing stack — it doesn't replace it.
Installation
pip install mlbuild
mlbuild doctor
For TFLite support:
pip install "mlbuild[tflite]"
For S3 remote storage:
pip install "mlbuild[s3]"
For macOS (CoreML + TFLite full stack):
pip install "mlbuild[macos]"
For Linux / CI (TFLite only, no CoreML):
pip install "mlbuild[linux]"
Quick Start
# 1. Build and convert model
mlbuild build --model model.onnx --target apple_m1 --quantize fp16
# 1b. Or import a pre-built model
mlbuild import --model model.tflite --target android_arm64
mlbuild import --model model.mlpackage --target apple_m1 --quantize fp16
# 2. Benchmark (automatic p50/p95/p99, task auto-detected)
mlbuild benchmark <build-id>
# 3. Sweep all optimization variants automatically
mlbuild explore model.onnx --target apple_m1
# 4. Check output divergence between variants
mlbuild accuracy <baseline-id> <candidate-id>
# 5. Validate SLAs (performance + accuracy in one command)
mlbuild validate <build-id> --max-latency 10 --dataset ./imagenet-mini/
# 6. Run full CI check against registered baseline
mlbuild ci --model model.onnx --baseline main-mobilenet
# 6b. Or use low-level compare
mlbuild compare baseline candidate --threshold 5 --check-accuracy --ci
# 7. View full optimization lineage
mlbuild log --source model.onnx --tree
# 8. Generate performance report
mlbuild report <build-id> --open
# 9. Tag for production
mlbuild tag create <build-id> production
GitHub Actions Integration
- name: MLBuild CI
run: |
pip install mlbuild
# Full CI check — explore, compare, report in one command
mlbuild ci \
--model models/mobilenet.onnx \
--baseline main-mobilenet \
--latency-regression 15 \
--size-regression 10
- name: Upload CI report
uses: actions/upload-artifact@v4
if: always()
with:
name: mlbuild-report
path: .mlbuild/ci_report.json
See .github/workflows/mlbuild.yml for a complete example with PR comment posting.
Documentation
Core Commands
Build and Convert
mlbuild build --model model.onnx --target apple_m1 --quantize fp16 --name "v2.0"
mlbuild build --model model.onnx --target android_arm64 --quantize int8
Import Pre-built Models
Register an existing TFLite or CoreML model directly — no conversion required. Once imported, all MLBuild commands (benchmark, profile, compare, report, ci-check) work on it immediately.
# Import a TFLite model
mlbuild import --model model.tflite --target android_arm64
# Import a CoreML model
mlbuild import --model model.mlpackage --target apple_m1
# Import an ONNX model (benchmarked via ONNX Runtime)
mlbuild import --model model.onnx --target onnxruntime_cpu
mlbuild import --model model.onnx --target onnxruntime_gpu
# Import with metadata
mlbuild import --model model.tflite --target android_arm64 \
--quantize int8 \
--name "vendor-v2" \
--notes "Supplied by vendor, int8 quantized"
# JSON output (for CI pipelines)
mlbuild import --model model.tflite --target android_arm64 --json
Supported formats:
.onnx— validated via protobuf check, runs via ONNX Runtime.tflite— validated via FlatBuffer magic bytes (TFL3/TFL2).mlpackage— validated via Manifest.json + Data/ structure.mlmodel— legacy CoreML flat file
Format/target compatibility:
| Format | Valid Targets |
|---|---|
onnx |
onnxruntime_cpu, onnxruntime_gpu, onnxruntime_ane |
tflite |
android_arm64, android_arm32, android_x86, raspberry_pi, coral_tpu, generic_linux |
coreml |
apple_m1, apple_m2, apple_m3, apple_a15, apple_a16, apple_a17 |
Imported builds are marked [imported] in mlbuild log output and tracked with "imported": true in their metadata.
Optimize
Generate optimized variants of a registered build. Supports quantization and magnitude pruning. All variants are registered as children of the source build with full lineage tracking.
Quantization
# FP16 — recompiles from ONNX graph (lower precision weights)
mlbuild optimize <build_id> --pass quantize --method fp16
# Dynamic range INT8 — weight-only, no calibration data needed
mlbuild optimize <build_id> --pass quantize --method int8
# Static INT8 — quantizes weights + activations using calibration data
mlbuild optimize <build_id> --pass quantize --method int8 \
--calibration-data ./imagenet-mini/
Calibration data formats for static INT8:
- Directory of images (
.jpg,.png,.bmp,.webp) — auto-resized to model input shape, normalized to [0, 1] - Directory of
.npyfiles — one array per sample - Single
.npzfile — named array, first axis = samples
Static and dynamic INT8 are stored as distinct builds (int8 vs int8_static) — both can coexist in the registry with different build IDs.
Note: Full static INT8 (weight + activation quantization) requires coremltools 9.1+. On 9.0, MLBuild automatically falls back to dynamic range INT8 with a clear warning — no crash, no silent misbehavior.
Pruning
Magnitude-based unstructured weight pruning. Zeros out the smallest weights by absolute value up to a target sparsity level. No retraining required.
# 50% sparsity
mlbuild optimize <build_id> --pass prune --sparsity 0.5
# 75% sparsity
mlbuild optimize <build_id> --pass prune --sparsity 0.75
Routing logic:
has_graph=True→ ONNX magnitude pruning → re-convert via existing build pipeline (works for CoreML and TFLite)has_graph=False + coreml→ CT9OpMagnitudePrunerConfigpost-hoc on compiled.mlpackagehas_graph=False + tflite→ Error with actionable message (Re-register using 'mlbuild build' or 'mlbuild import --graph model.onnx')
Pruning skips bias, batch norm, and small tensors (< 256 params) automatically. Sparsity level is baked into the method name (prune_0.50), so each level gets a distinct build ID.
Method chaining
Pruning and quantization can be chained arbitrarily:
# Prune first, then quantize
mlbuild optimize <build_id> --pass prune --sparsity 0.5
mlbuild optimize <pruned_build_id> --pass quantize --method int8
Explore
Sweeps all optimization variants for a model in one command. Builds the fp32 baseline, generates fp16 and int8 variants, benchmarks all of them, and assigns verdicts.
# Full sweep (fp16 + int8, all backends)
mlbuild explore model.onnx --target apple_m1
# Fast mode (fp16 only, 20 benchmark runs)
mlbuild explore model.onnx --target apple_m1 --fast
# Specific backends
mlbuild explore model.onnx --backends coreml
mlbuild explore model.onnx --backends coreml,tflite
# With static INT8 calibration data
mlbuild explore model.onnx --calibration-data ./imagenet-mini/
# With output divergence check per variant
mlbuild explore model.onnx --check-accuracy --cosine-threshold 0.99
# JSON output
mlbuild explore model.onnx --output-json
Verdict logic (score-based):
score = 0.6 × (baseline_latency / variant_latency)
+ 0.4 × (baseline_size / variant_size)
score > 1.0 → candidate for recommended or aggressive
score ≤ 1.0 → skip (strictly worse on both axes)
recommended— highest composite score (best balance of speed + size)aggressive— smallest size among remaining candidatesskip— no improvement, or accuracy check failedbaseline— fp32 reference
COREML
Verdict Method Size p50 Latency vs Baseline Accuracy
baseline fp32 13.39 MB 3.29ms — —
aggressive fp16 6.74 MB 3.29ms ↑0% lat —
↓50% size
recommended int8(static) 3.58 MB 2.81ms ↓14% lat ✓ 0.9999
↓73% size
Accuracy
Standalone output divergence check between two builds. Runs inference on both with synthetic inputs and computes similarity metrics.
mlbuild accuracy <baseline_id> <candidate_id>
mlbuild accuracy <baseline_id> <candidate_id> --samples 64 --seed 42
mlbuild accuracy <baseline_id> <candidate_id> \
--cosine-threshold 0.99 \
--top1-threshold 0.99
Metrics:
- Cosine similarity — angle between output vectors (1.0 = identical direction)
- Mean absolute error — average per-element absolute difference
- Max absolute error — worst-case per-element difference
- Top-1 agreement — fraction of samples where both models pick the same top class (classifiers only)
Results are persisted to the registry's accuracy_checks table.
Example results on MobileNet:
fp32 → fp16: cosine=0.9999 top1=1.00 passed=True
fp32 → int8: cosine=0.9983 top1=0.97 passed=False (< 0.99 threshold)
Benchmark
mlbuild benchmark <build-id> --runs 100 --warmup 20 --json
mlbuild benchmark <build-id> --compute-unit CPU_ONLY
Validate SLAs
Validates a build against performance and accuracy constraints. All checks compose in a single command.
# Performance constraints only
mlbuild validate <build_id> --max-latency 10 --max-size 8
# Accuracy constraint with dataset
mlbuild validate <build_id> \
--dataset ./imagenet-mini/ \
--cosine-threshold 0.99 \
--top1-threshold 0.99
# All checks composed
mlbuild validate <build_id> \
--max-latency 10 \
--max-size 8 \
--dataset ./imagenet-mini/ \
--cosine-threshold 0.99
# CI mode (suppress output, exit codes only)
mlbuild validate <build_id> --max-latency 5 --ci
Options:
--max-latency— maximum p50 latency in ms--max-p95— maximum p95 latency in ms--max-memory— maximum peak memory in MB--max-size— maximum model size in MB--dataset— calibration data for accuracy check (images dir,.npydir, or.npz)--baseline-id— reference build for accuracy comparison (default: root build)--cosine-threshold— minimum cosine similarity (default: 0.99)--top1-threshold— minimum top-1 agreement (default: 0.99)--accuracy-samples— max calibration samples (default: 200)
If --dataset is provided but the build is the root (no baseline to compare against), accuracy check is skipped with a message rather than erroring.
Exit codes: 0 = all constraints passed, 1 = one or more violations.
Compare and Detect Regressions
# Compare with independent latency + size thresholds
mlbuild compare baseline candidate \
--threshold 5 \
--size-threshold 10 \
--metric p95 \
--ci
# Use cached benchmark results (skip re-benchmarking)
mlbuild compare baseline candidate --use-cached
# Include output divergence check
mlbuild compare baseline candidate --check-accuracy
# Dedicated CI gate
mlbuild ci-check baseline candidate
mlbuild ci-check baseline candidate --latency-threshold 10 --size-threshold 5
mlbuild ci-check baseline candidate --strict # any positive delta fails
mlbuild ci-check baseline candidate --json
Exit codes:
0— no regression (safe to ship)1— regression detected (block the PR)2— error (infra failure, check logs)
CI Orchestration
Full CI check in one command — resolves baseline, explores variants, compares, enforces thresholds, and writes a structured report.
# Run full CI check against a tagged baseline
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet
# Use an existing build (skips explore — useful when builds happen earlier in pipeline)
mlbuild ci --build <build_id> --baseline main-mobilenet
# With absolute budgets (independent of baseline)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet \
--latency-budget 3.0 \
--size-budget 10.0
# With accuracy gate
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet \
--dataset ./imagenet-mini/ \
--cosine-threshold 0.99
# JSON output (for dashboards and GitHub bots)
mlbuild ci --build <build_id> --baseline main-mobilenet --json
# Fail if baseline tag not found (strict CI)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet --fail-on-missing-baseline
Tagging baselines:
# Tag a build as the main branch baseline
mlbuild tag create <build_id> main-mobilenet
# --baseline accepts tag names or build ID prefixes
mlbuild ci --baseline main-mobilenet # tag lookup
mlbuild ci --baseline 3f36810e # build ID prefix
Options:
| Flag | Description | Default |
|---|---|---|
--model |
ONNX model path (runs explore) | — |
--build |
Existing build ID (skips explore) | — |
--baseline |
Tag name or build ID | required |
--target |
Device target for explore | auto |
--latency-regression |
Max latency regression % | 10.0 |
--size-regression |
Max size regression % | 5.0 |
--latency-budget |
Hard latency cap in ms | none |
--size-budget |
Hard size cap in MB | none |
--dataset |
Calibration data for accuracy check | none |
--cosine-threshold |
Min cosine similarity | 0.99 |
--top1-threshold |
Min top-1 agreement | 0.99 |
--fail-on-missing-baseline |
Exit 1 if baseline not found | false |
--json |
Print JSON report to stdout | false |
CI Report — always written to .mlbuild/ci_report.json:
{
"model": "mobilenet.onnx",
"baseline": {
"tag": "main-mobilenet",
"build_id": "3f36810e...",
"latency_ms": 2.49,
"size_mb": 13.39
},
"candidate": {
"build_id": "b8aa1ef6...",
"variant": "fp16",
"parent_build_id": "3f36810e...",
"latency_ms": 0.74,
"size_mb": 6.74
},
"delta": { "latency_pct": -70.27, "size_pct": -49.64 },
"thresholds": {
"latency_regression_pct": 10.0,
"size_regression_pct": 5.0,
"latency_budget_ms": null,
"size_budget_mb": null
},
"result": "pass",
"violations": []
}
The report always stores baseline.build_id — even if the tag is later repointed, the report preserves exactly what was compared.
Exit codes: 0 = pass or skipped, 1 = regression/failure, 2 = error.
Configuration via .mlbuild/config.toml:
[ci]
latency_regression_pct = 10
size_regression_pct = 5
latency_budget_ms = 3.0
size_budget_mb = 10.0
[ci.accuracy]
cosine_threshold = 0.99
top1_threshold = 0.99
Quantization Tradeoff Analysis
mlbuild compare-quantization fp32-build int8-build
mlbuild compare-quantization fp32-build int8-build --accuracy-samples 100
mlbuild compare-quantization fp32-build int8-build --json
Performance Report
mlbuild report <build-id>
mlbuild report <build-id> --open
mlbuild report <build-id> --output report.html
mlbuild report <build-id> --format pdf # requires: pip install weasyprint
Deep Profiling
# TFLite: full 6-feature deep profile (no device required)
mlbuild profile <build-id> --deep
# CoreML: cold start decomposition (all formats)
mlbuild profile <build-id> --deep
# Options
mlbuild profile <build-id> --deep --top 20
mlbuild profile <build-id> --deep --runs 100
mlbuild profile <build-id> --deep --int8-build <id> # TFLite: quant sensitivity
TFLite deep profiling features (--deep):
| # | Feature | Description |
|---|---|---|
| ① | Per-op timing | Real hardware timing via TFLite's built-in op profiler |
| ② | Memory flow | Activation memory at each layer boundary, peak flagged |
| ③ | Bottleneck classification | COMPUTE vs MEMORY bound per op (arithmetic intensity) |
| ④ | Cold start decomposition | Load → first inference → stable, with warmup sparkline |
| ⑤ | Quantization sensitivity | Per-layer fp32 vs int8 divergence (requires --int8-build) |
| ⑥ | Fusion detection | Fused kernels identified + missed fusion opportunities flagged |
Build History
# All builds
mlbuild log
# Specific build detail
mlbuild log <build_id>
# Filter by source model filename (substring match)
mlbuild log --source mobilenet.onnx
# Full optimization lineage tree (recursive parent-child)
mlbuild log --source mobilenet.onnx --tree
# Other filters
mlbuild log --name mobilenet
mlbuild log --format coreml
mlbuild log --task vision
mlbuild log --roots-only
mlbuild log --target apple_m1
# Export
mlbuild log --json
mlbuild log --csv builds.csv
The --tree flag renders the full optimization DAG using actual parent-child lineage. Method chaining (e.g. prune → int8) shows as nested children, not flat siblings — causality is preserved:
3f36810e mobilenet coreml fp32 13.39 MB 2.49ms
├── b8aa1ef6 coreml fp16 6.74 MB 0.74ms
├── 2921f0fa coreml int8 3.58 MB 3.15ms
├── 9df061cb coreml int8(static) 3.58 MB 2.81ms
├── 329f3b78 coreml prune(0.50) 13.39 MB 3.89ms
│ └── 3fa93712 coreml int8 3.58 MB 2.61ms
└── 0a17ce03 coreml prune(0.75) 13.39 MB 2.94ms
Method labels are human-readable: prune(0.50), int8(static) instead of raw internal strings.
Command History
A permanent log of every MLBuild command ever run. Searchable, filterable, deletable.
# Show all recent commands
mlbuild history
# Filter by command type — every command is filterable
mlbuild history --filter build
mlbuild history --filter benchmark
mlbuild history --filter validate
mlbuild history --filter baseline
mlbuild history --filter budget
mlbuild history --filter status
mlbuild history --filter import
mlbuild history --filter compare
mlbuild history --filter failed
# ...and all other commands (accuracy, ci, diff, explore, optimize, profile, etc.)
# Filter by time
mlbuild history --since yesterday
mlbuild history --since "7 days ago"
mlbuild history --since "2024-01-01"
# Filter by build ID — everything that touched a specific build
mlbuild history --build-id a3f91c2
# Limit results
mlbuild history --limit 100
# Delete one entry by ID (min 4 chars)
mlbuild history delete d58cc62f
# Clear all history (prompts for confirmation)
mlbuild history clear
History is an audit log of CLI actions — separate from build and benchmark data. Deleting a history entry never touches builds or benchmarks.
Performance Budget
Persistent performance constraints committed to git. Set once, enforced automatically by mlbuild validate and mlbuild ci. Explicit flags always override budget values.
# Set constraints once
mlbuild budget set --max-latency 10 --max-p95 15 --max-size 8
# Show current budget
mlbuild budget show
# Preview what would apply to a build without benchmarking
mlbuild budget validate <build_id>
# Update one constraint without touching others
mlbuild budget set --max-latency 5
# Remove one constraint
mlbuild budget clear --constraint max-latency
# Remove all constraints (prompts for confirmation)
mlbuild budget clear
# After budget is set, validate uses it automatically
mlbuild validate <build_id> ← uses budget
mlbuild validate <build_id> --max-latency 3 ← overrides latency, budget for rest
Budget is stored in .mlbuild/budget.toml — commit it so your whole team enforces the same constraints automatically.
Merge priority: explicit CLI flag > budget file > no constraint
Violation output shows the source of each constraint:
┃ Constraint ┃ Limit ┃ Actual ┃ Violation ┃ Source ┃
│ max_latency_ms │ 1.00 ms │ 2.66 ms │ +1.66 (166% over) │ explicit flag │
│ max_size_mb │ 8.00 MB │ 9.10 MB │ +1.10 (13.8% over) │ budget │
Baseline Management
Clean UX wrapper around mlbuild tag. Uses the reserved tag mlbuild-baseline so mlbuild ci resolves it automatically — zero CI changes required.
# Set a build as the performance baseline
mlbuild baseline set <build_id>
# Show current baseline
mlbuild baseline
mlbuild baseline show
# Show all baseline-style tags (mlbuild-baseline, main-*, production-*)
mlbuild baseline history
# Remove baseline (prompts for confirmation)
mlbuild baseline unset
The baseline integrates directly with mlbuild ci:
mlbuild ci --model model.onnx --baseline mlbuild-baseline
Prompts before overwriting an existing baseline. Use --force to skip the prompt.
Workspace Status
Quick health check of the current workspace. Reads from existing data — no new storage.
mlbuild status
mlbuild status --json
Output:
MLBuild Status Abdoulayes-MacBook-Air.local
✓ Workspace .mlbuild/
✓ Registry 26 builds | 18 benchmarks
Last build: mobilenet (coreml, 3.58 MB) — 2h ago
Last bench: p50=2.61 ms — 2h ago
✓ Baseline 3fa9371209e6 mobilenet 2.61 ms 3.58 MB
Last validate: PASSED — 52m ago
✓ Budget .mlbuild/budget.toml
Max latency (p50) 10.0 ms
Max size 8.0 MB
Version Management
mlbuild log --limit 20
mlbuild diff build-a build-b
mlbuild tag create <build-id> v1.0.0
Experiment Tracking
mlbuild experiment create "quantization-search"
mlbuild run start --experiment "quantization-search"
mlbuild run log-param quantization int8
mlbuild run log-metric latency_p50 5.6
mlbuild run end
Remote Storage
# Set up S3-compatible remote (one-time)
mlbuild remote add prod \
--backend s3 \
--bucket your-bucket \
--region us-east-1
# Push/pull/sync builds
mlbuild push <build-id>
mlbuild pull <build-id>
mlbuild sync
Supported backends: AWS S3, Cloudflare R2 (recommended — free 10 GB), Backblaze B2, any S3-compatible storage.
Task-Aware Benchmarking
MLBuild automatically detects what kind of model you're benchmarking — vision, NLP, or audio — and generates semantically correct synthetic inputs for it. No dummy zero arrays, no manual shape specification.
Automatic Task Detection
Detection runs through three tiers in order of confidence:
| Tier | Method | Formats | Confidence | CLI Behavior |
|---|---|---|---|---|
| Graph | Op/layer analysis (Conv, Attention, STFT, etc.) |
ONNX, TFLite, CoreML | High | Silent |
| Name | Tensor name heuristics (input_ids, pixel_values, mel) |
All | Medium | Warning |
| Shape | Dtype + rank heuristics (rank-4 float = vision, rank-2 int = NLP) | All | Low | Warning + zeros fallback |
# High confidence — silent, correct inputs generated automatically
mlbuild benchmark <build-id>
# Medium confidence — warning printed, benchmark proceeds
# ⚠ Task auto-detected as 'nlp' (medium confidence)
# If incorrect, re-run with: --task vision|nlp|audio
mlbuild benchmark <build-id>
# Low confidence or unknown — zeros used as safe fallback
# ⚠ Task could not be detected — running with zero tensors
mlbuild benchmark <build-id>
Override with --task
mlbuild benchmark <build-id> --task vision
mlbuild benchmark <build-id> --task nlp
mlbuild benchmark <build-id> --task audio
mlbuild profile <build-id> --task nlp
mlbuild validate <build-id> --task vision --strict-output
Task-Specific Synthetic Inputs
| Task | Inputs Generated |
|---|---|
| Vision | Float32 image tensor, NCHW layout, spatial dims resolved to 224×224 |
| NLP | int64 token IDs (random vocab up to 30k), int64 attention mask (all ones), token type IDs |
| Audio | Float32 waveform [-1, 1] or log-mel spectrogram — role inferred from tensor name/shape |
| Unknown | Zero tensors — safe fallback that never blocks CI |
NLP Multi-Sequence Benchmarking
NLP models are benchmarked across a sequence length ladder by default:
# Default ladder: [16, 64, 128, 256]
mlbuild benchmark <build-id> --task nlp
# seq_len=16 p50=1.2ms p95=1.4ms
# seq_len=64 p50=2.1ms p95=2.4ms
# seq_len=128 p50=3.8ms p95=4.2ms
# seq_len=256 p50=7.1ms p95=8.0ms
# Clip to model's actual max sequence length
mlbuild benchmark <build-id> --task nlp --seq-len 128
Strict Output Validation
# Soft mode (default) — warns but proceeds
mlbuild benchmark <build-id> --task nlp
# Strict mode — exits non-zero on output anomaly
mlbuild benchmark <build-id> --task nlp --strict-output
mlbuild validate <build-id> --task vision --strict-output
# Global strict mode — applies to all commands
mlbuild --strict-output benchmark <build-id> --task nlp
Optimization Workflow
A complete optimization workflow from ONNX to deployment-ready model:
# 1. Build FP32 baseline
mlbuild build --model mobilenet.onnx --target apple_m1 --name mobilenet
# 2. Sweep all variants automatically
mlbuild explore mobilenet.onnx --target apple_m1 --check-accuracy
# 3. Prune best variant and quantize the result
mlbuild optimize <fp32_id> --pass prune --sparsity 0.5
mlbuild optimize <pruned_id> --pass quantize --method int8
# 4. Validate final model against SLAs
mlbuild validate <final_id> \
--max-latency 5 \
--max-size 6 \
--dataset ./imagenet-mini/
# 5. View full lineage
mlbuild log --source mobilenet.onnx --tree
# 6. Tag for production
mlbuild tag create <final_id> production-v2
CI/CD Regression Gate
# Full CI orchestration (recommended)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet
echo "Exit: $?" # 0 = pass, 1 = fail, 2 = error
# Low-level build-to-build comparison
mlbuild ci-check $BASELINE_ID $CANDIDATE_ID
echo "Exit: $?" # 0 = pass, 1 = regression, 2 = error
# JSON output for dashboards and PR bots
mlbuild ci --build $BUILD_ID --baseline main-mobilenet --json
# {
# "result": "pass",
# "baseline": { "tag": "main-mobilenet", "build_id": "3f36810e...", "latency_ms": 2.49 },
# "candidate": { "build_id": "b8aa1ef6...", "variant": "fp16", "latency_ms": 0.74 },
# "delta": { "latency_pct": -70.27, "size_pct": -49.64 },
# "violations": []
# }
Architecture
Training Phase
├── Experiment Tracking: MLflow / W&B / Neptune
└── Data Versioning: DVC
↓
Production Phase
├── Model Building: MLBuild build
├── Model Importing: MLBuild import ← pre-built TFLite / CoreML
├── Task Detection: MLBuild (automatic) ← vision / nlp / audio
├── Optimization Sweep: MLBuild explore ← fp16 + int8 + pruning
├── Accuracy Validation: MLBuild accuracy ← output divergence
├── Performance Validation: MLBuild ci-check ← regression gate
├── Quantization Analysis: MLBuild compare-quantization
├── Reporting: MLBuild report
└── Deployment: GitHub Actions / K8s
How It Works
1. Deterministic Builds
# Content-addressed storage (Git-style)
build_id = sha256(source_hash + config_hash + env_fingerprint)
# Same inputs = Same output (byte-for-byte)
2. Build Lineage Tracking
Every variant stores its full ancestry:
build.parent_build_id # direct parent
build.root_build_id # original source in the chain
build.optimization_method # "fp16", "int8", "int8_static", "prune_0.50"
Identical optimization chains always produce the same build ID — deduplication is automatic.
3. Automated Benchmarking
# Runs N iterations with warmup
# Calculates p50, p95, p99, mean, std
# Measures memory RSS delta, throughput
# Outlier trimming (top/bottom 5%)
4. Task-Aware Input Generation
# Three-tier detection: graph ops → tensor names → shapes
# Task-specific synthetic inputs (never zeros for known tasks)
# NLP: multi-seq-len ladder [16, 64, 128, 256]
# Post-inference output validation with configurable strictness
5. Output Divergence Checking
# Cosine similarity — output direction preservation
# MAE / max absolute error — per-element differences
# Top-1 agreement — classifier label consistency
# Streaming accumulators — memory-efficient over large batches
# Results persisted to accuracy_checks registry table
6. Dual Regression Detection
# Independent thresholds for latency and size
latency_regression = latency_change_pct > latency_threshold
size_regression = size_change_pct > size_threshold
regression_detected = latency_regression or size_regression
7. Explore Verdict Scoring
score = 0.6 * (baseline_latency / variant_latency) \
+ 0.4 * (baseline_size / variant_size)
# score > 1.0 → candidate for recommended/aggressive
# score ≤ 1.0 → skip (strictly worse on both axes)
Features
Build and Convert
- ONNX → CoreML conversion (Apple Silicon, A-series)
- ONNX → TFLite conversion (Android arm64)
- Quantization: FP32 / FP16 / INT8
- Deterministic builds (content-addressed)
- ONNX graph storage for re-conversion
Import Pre-built Models
- Import existing
.onnx,.tflite,.mlmodel,.mlpackagefiles directly - ONNX import runs via ONNX Runtime —
onnxruntime_cpu,onnxruntime_gpu,onnxruntime_anetargets - Format validation via protobuf check (ONNX), magic bytes (TFLite), structure checks (CoreML)
- Tier 1 task detection for all import formats — ONNX via graph ops, TFLite via FlatBuffer parsing, CoreML via coremltools spec
- Format/target compatibility enforcement
- Imported builds tracked with
[imported]badge inmlbuild log - Full MLBuild toolchain available immediately after import
Optimization
- FP16 quantization — recompilation from ONNX graph
- Dynamic range INT8 — weight-only, no calibration data needed
- Static INT8 — weights + activations quantized using representative calibration data; gracefully falls back to dynamic range on coremltools 9.0
- Magnitude pruning — global threshold-based, ONNX path works for both CoreML and TFLite, CoreML post-hoc path for imported models
- Method chaining — prune → quantize, any depth
- Distinct build IDs per optimization level (
int8_static≠int8,prune_0.50≠prune_0.75) - Deduplication — identical optimization chains reuse existing builds
Optimization Sweep
mlbuild explore— single command sweeps fp16 + int8 across all backends- Score-based verdict assignment (recommended / aggressive / skip / baseline)
- Accuracy check per variant with
--check-accuracy— failed variants getskipverdict - Calibration data support with
--calibration-datafor static INT8 in sweep - Fast mode (
--fast) — fp16 only, 20 benchmark runs
Accuracy / Output Divergence
- Cosine similarity, MAE, max absolute error, top-1 agreement
- Dtype-aware random input generation
precomputed_batch— inputs generated once, reused across all variants in explore- Results persisted to
accuracy_checksregistry table - Standalone
mlbuild accuracycommand - Integrated into
mlbuild compare --check-accuracyandmlbuild explore --check-accuracy
Task-Aware Benchmarking
- Three-tier automatic task detection (graph ops → tensor names → shapes)
- Task-specific synthetic inputs: real image tensors, token IDs + attention masks, waveforms/spectrograms
- NLP multi-sequence-length benchmarking ladder
[16, 64, 128, 256] - Configurable
--taskoverride for explicit control - Post-inference output validation with soft/strict modes (
--strict-output)
Performance Validation
- Automated p50/p95/p99 benchmarking
- SLA enforcement (
--max-latency,--max-p95,--max-memory,--max-size) - Accuracy validation via
--dataset(calibration data), composes with performance checks - Baseline accuracy comparison with
--baseline-id(defaults to root build) - Root builds skip accuracy check gracefully rather than erroring
Deep Profiling (--deep)
- TFLite: Per-op timing (real hardware), tensor memory flow, COMPUTE/MEMORY bottleneck classification, cold start decomposition, per-layer quantization sensitivity (fp32 vs int8), op fusion detection
- CoreML: Cold start decomposition (all formats); per-layer timing, memory flow, bottleneck classification, and fusion detection (NeuralNetwork format only)
Build History and Lineage
mlbuild log --source— filter builds by source model filenamemlbuild log --tree— recursive parent-child DAG — causality preserved across optimization chains- Human-readable method labels in tree:
prune(0.50),int8(static) - Filter by name, format, task, target, date range, roots-only
- JSON and CSV export
Command History
mlbuild history— permanent audit log of every CLI command ever run- Searchable by command type, time window, build ID
- Filterable: build, benchmark, validate, compare, profile, failed
- Delete individual entries or clear all — never affects build or benchmark data
- Machine identity captured on every row — ready for cross-machine team view when cloud login lands
Performance Budget
mlbuild budget set/show/clear/validate— persistent constraint management- Stored in
.mlbuild/budget.toml— commit to git for team-wide enforcement - Merge logic: explicit CLI flag > budget > no constraint
- Constraint source shown in violation output (
budgetvsexplicit flag) - All four constraints:
max_latency_ms,max_p95_ms,max_memory_mb,max_size_mb - Applied automatically by
mlbuild validateandmlbuild ci budget validate <build_id>— dry run, evaluates size immediately, flags latency as pending
Baseline Management
mlbuild baseline set/show/unset/history— clean UX wrapper aroundmlbuild tag- Uses reserved tag
mlbuild-baseline— integrates withmlbuild ciautomatically - Prompts before overwriting existing baseline
baseline history— shows all baseline-style tags:mlbuild-baseline,main-*,production-*
Workspace Status
mlbuild status— instant workspace health snapshot- Shows build/benchmark counts, last build, last benchmark, last validate result
- Shows current baseline and active budget constraints
- JSON output via
--jsonfor scripting
Performance Reports
- Self-contained HTML (no external dependencies)
- Benchmark history table
- Related builds comparison
- Deployment recommendations
- Optional PDF export (requires weasyprint)
Remote Storage
- S3-compatible backends (AWS, R2, B2)
- Git-style push/pull/sync
- Integrity verification (SHA-256)
CI/CD Integration
mlbuild ci— full CI orchestration (explore + compare + threshold enforcement + JSON report)- Tag-based baseline resolution —
mlbuild tag create <id> main-mobilenet - Baseline immutability — report stores both tag name and build ID for reproducibility
- Baseline benchmark guard — auto-benchmarks baseline if no cached latency
- Relative regression thresholds (
--latency-regression,--size-regression) - Absolute budget constraints (
--latency-budget,--size-budget) independent of baseline - Accuracy gate via
--dataset— cosine similarity + top-1 agreement --fail-on-missing-baseline— strict mode for production pipelines- Structured JSON report at
.mlbuild/ci_report.json— readable by GitHub bots, dashboards, Slack mlbuild ci-check— low-level build-to-build regression gate- Exit codes: 0 (pass/skip) / 1 (regression/fail) / 2 (error)
- GitHub Actions workflow with artifact upload and PR comment posting (
.github/workflows/mlbuild.yml)
Project Structure
mlbuild/
├── src/mlbuild/
│ ├── cli/
│ │ ├── commands/
│ │ │ ├── accuracy.py # mlbuild accuracy
│ │ │ ├── baseline.py # mlbuild baseline
│ │ │ ├── benchmark.py # mlbuild benchmark
│ │ │ ├── budget.py # mlbuild budget
│ │ │ ├── build.py # mlbuild build
│ │ │ ├── ci.py # mlbuild ci + ci-check
│ │ │ ├── compare.py # mlbuild compare
│ │ │ ├── compare_compute_units.py # mlbuild compare-compute-units
│ │ │ ├── compare_quantization.py # mlbuild compare-quantization
│ │ │ ├── diff.py # mlbuild diff
│ │ │ ├── doctor.py # mlbuild doctor
│ │ │ ├── experiment.py # mlbuild experiment
│ │ │ ├── explore.py # mlbuild explore
│ │ │ ├── history.py # mlbuild history
│ │ │ ├── import_cmd.py # mlbuild import
│ │ │ ├── log.py # mlbuild log
│ │ │ ├── optimize.py # mlbuild optimize
│ │ │ ├── profile.py # mlbuild profile
│ │ │ ├── pull.py # mlbuild pull
│ │ │ ├── push.py # mlbuild push
│ │ │ ├── remote.py # mlbuild remote
│ │ │ ├── report.py # mlbuild report
│ │ │ ├── run.py # mlbuild run
│ │ │ ├── status.py # mlbuild status
│ │ │ ├── sync.py # mlbuild sync
│ │ │ ├── tag.py # mlbuild tag
│ │ │ └── validate.py # mlbuild validate
│ │ └── main.py # CLI entry point
│ ├── backends/
│ │ ├── base.py # Backend base class
│ │ ├── registry.py # Backend auto-discovery
│ │ ├── coreml/ # CoreML exporter + deep profiler
│ │ ├── tflite/ # TFLite backend + deep profiler
│ │ └── onnxruntime/ # ONNX Runtime backend
│ ├── benchmark/
│ │ ├── runner.py # Benchmark runner + stats
│ │ └── device_runner.py # Device benchmark runner
│ ├── core/
│ │ ├── budget.py # Budget load/save/merge/validate
│ │ ├── accuracy/
│ │ │ ├── calibration.py # CalibrationLoader (images/npy/npz)
│ │ │ ├── checker.py # run_accuracy_check()
│ │ │ ├── config.py # AccuracyConfig, AccuracyResult
│ │ │ ├── inputs.py # InputSpec, generate_batch
│ │ │ └── metrics.py # cosine_similarity, MAE, top-1
│ │ ├── ci/
│ │ │ ├── reporter.py # CIReport + text/JSON/markdown formatters
│ │ │ ├── runner.py # CIRunner orchestration
│ │ │ └── thresholds.py # ThresholdConfig + violation evaluation
│ │ ├── environment.py # Environment fingerprinting
│ │ ├── errors.py # Error types
│ │ ├── format_detection.py # Format detection + target validation
│ │ ├── hash.py # Deterministic artifact hashing
│ │ ├── ir.py # ModelIR — format-agnostic model graph
│ │ ├── machine.py # Machine identity (UUID + hostname)
│ │ ├── task_detection.py # Three-tier task detection
│ │ ├── task_inputs.py # Task-aware synthetic input generation
│ │ ├── task_validation.py # Post-inference output validation
│ │ ├── tasks.py # Task types + arbitration + output schemas
│ │ └── types.py # Build, Benchmark, VariantResult dataclasses
│ ├── experiments/ # Experiment + run tracking
│ ├── explore/
│ │ └── explorer.py # explore(), assign_verdicts(), accuracy integration
│ ├── loaders/
│ │ ├── loader.py # Model loading entrypoint
│ │ └── onnx_loader.py # ONNX loader + ModelIR builder
│ ├── optimize/
│ │ ├── optimizer.py # optimize() + prune() entrypoints
│ │ ├── passes/
│ │ │ ├── pruning.py # PruningPass (ONNX + CoreML post-hoc)
│ │ │ └── quantization.py # QuantizationPass (fp16/int8/int8_static)
│ │ └── backends/
│ │ ├── coreml_backend.py # compile_from_graph, quantize_weights,
│ │ │ # quantize_weights_static, prune_weights
│ │ └── tflite_backend.py # quantize_from_graph
│ ├── profiling/
│ │ ├── cold_start.py # Cold start decomposition
│ │ ├── layer_profiler.py # Per-layer timing
│ │ ├── memory_profiler.py # Memory tracking
│ │ └── warmup_analyzer.py # Warmup analysis
│ ├── registry/
│ │ ├── local.py # SQLite registry (WAL mode)
│ │ └── schema.py # Schema + migrations (v9)
│ ├── storage/
│ │ ├── backend.py # Storage backend interface
│ │ ├── config.py # Remote config
│ │ ├── local.py # Local storage
│ │ └── s3.py # S3-compatible storage
│ ├── validation/
│ │ └── accuracy_validator.py # AccuracyValidator for mlbuild validate
│ └── visualization/
│ └── charts.py # Chart generation
├── tests/
├── pyproject.toml
└── README.md
vs. Existing Tools
| Feature | Custom Scripts | Profilers | MLBuild |
|---|---|---|---|
| Hardware inference benchmarking | Manual | Partial | Automated |
| Performance regression detection | Custom | Manual | Built-in |
| CI performance gate | Custom | — | Built-in |
| Cross-device testing | Manual | — | Yes |
| Performance history & tracking | — | — | Built-in |
| CI-automated per-layer profiling | Custom | Manual | Automated |
| Quantization performance benchmarking | — | Manual | Automated |
| Auto-generated task inputs | — | — | Auto-detected |
| Performance reports | — | — | HTML/PDF |
Use MLflow/W&B for training experiments. Use MLBuild for on-device inference performance.
Development
git clone https://github.com/AbdoulayeSeydi/mlbuild.git
cd mlbuild
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
pytest tests/
Contributing
See CONTRIBUTING.md for development setup, coding standards, and PR process.
License
MIT License — see LICENSE for details.
Roadmap
Phase 1 — Device-Connected Benchmarking (next)
- Android ADB bridge — benchmark on connected Android devices without Android Studio
- Xcode Instruments integration — real iPhone hardware profiling
Phase 2 — More Backends
- TensorRT — NVIDIA GPU inference
- Qualcomm QNN — Snapdragon NPU
Phase 3 — Cloud Benchmarking
- Remote benchmark execution on cloud hardware
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlbuild-0.3.0.tar.gz.
File metadata
- Download URL: mlbuild-0.3.0.tar.gz
- Upload date:
- Size: 302.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
674e939d75f63173f597a49ea7d0d9e2c12099cd0da4ca49fc9b98c16ca5c960
|
|
| MD5 |
924f089a05c1fde4104d205894e8c6e8
|
|
| BLAKE2b-256 |
fbf4c36ed636bd3f24ec467eb8f1b9f7333d326467d477c939813c7280d241d0
|
File details
Details for the file mlbuild-0.3.0-py3-none-any.whl.
File metadata
- Download URL: mlbuild-0.3.0-py3-none-any.whl
- Upload date:
- Size: 322.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0538a3846da5d8c87f3a3814ba15feb870c95d11081fa462a5a034707326e3
|
|
| MD5 |
74a0ba1cb31c105a371500a234afff4c
|
|
| BLAKE2b-256 |
a04f4d60a4b46a745a6cf2e2800d7cb662baa388f3530f51b6cdbe393c50e79e
|