Performance CI/CD for on-device ML models — catch inference regressions before they ship

These details have not been verified by PyPI

Project links

Project description

MLBuild

MLBuild Logo

Performance CI/CD for On-Device Production ML Models

MLBuild is the missing performance layer for on-device ML CI/CD. While MLflow, DVC, and W&B track training experiments, MLBuild enforces production SLAs — automatically benchmarking inference performance, validating against thresholds, blocking regressions in CI, and generating deployment-ready reports.

Installation · Quick Start · Documentation · Roadmap

Current Status

Feature	Status
Input formats	ONNX, TFLite, CoreML
Backends	CoreML, TFLite, ONNX Runtime
Storage	Local + S3-compatible (AWS S3, R2, B2)
Targets	Apple Silicon, A-series, Android (arm64)
Platform	macOS, Linux (TFLite)
Command history	Local, searchable, filterable by every command
Performance budget	Persistent constraints in .mlbuild/budget.toml
Baseline management	Reserved tag with clean CLI
Workspace status	Quick health snapshot

The Problem

# Your CI passes
pytest              ✓
black --check       ✓
mypy                ✓

# But in production
Latency:  8ms  --> 15ms   (88% slower)
Memory:   50MB --> 120MB  (140% more)
Size:     6MB  --> 10MB   (67% larger)

# Nobody caught it until users complained

The gap: Existing tools don't validate production performance in CI.

The Solution

# Tag your main branch baseline once
mlbuild tag create <build_id> main-mobilenet

# Add one step to your CI pipeline
mlbuild ci --model model.onnx --baseline main-mobilenet

# Output:
# MLBuild CI Report
# ──────────────────────────────────────────────────
# Model:     mobilenet
# Baseline:  3f36810e (main-mobilenet)
# Candidate: b8aa1ef6 (fp16)
#
#                      Baseline     Candidate       Delta
# Latency (p50)         2.49 ms       0.74 ms     -70.27%
# Size                 13.39 MB       6.74 MB     -49.64%
#
# Result: ✓ PASS
# Exit code: 0

# Or use the low-level gate directly
mlbuild ci-check $BASELINE_ID $CANDIDATE_ID --latency-threshold 10
# Exit code: 1 — PR blocked on regression

Catch latency AND size regressions before they reach production.

Where MLBuild Fits

MLBuild is the missing on-device performance layer in your ML CI/CD stack.

┌─────────────────────────────────────────────────────────────────┐
│  ML Training                                                    │
│  ├── Experiment Tracking ──────────────── MLflow / W&B         │
│  └── Data Versioning ──────────────────── DVC                  │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  On-Device Optimization              MLBuild                    │
│  ├── Model Packaging ──────────────── mlbuild build             │
│  ├── Model Import ─────────────────── mlbuild import            │
│  ├── Task Detection ───────────────── automatic                 │
│  ├── Performance Validation ───────── mlbuild benchmark         │
│  ├── Quantization Benchmarking ────── mlbuild compare-quant     │
│  └── Reporting ────────────────────── mlbuild report            │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Regression Gate                     MLBuild CI                 │
│  ✕  Bad performance → blocks deployment                        │
│  ├── CI Performance Gate ─────────── mlbuild ci-check          │
│  └── Full CI Orchestration ───────── mlbuild ci               │
└─────────────────────────────┬───────────────────────────────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────┐
│  Deployment                                                     │
│  └── Release & Ship ───────────────── GitHub Actions / K8s     │
└─────────────────────────────────────────────────────────────────┘

Feature	MLflow / W&B / DVC	MLBuild
Track training experiments	Yes	No (use MLflow)
Automated p50/p95/p99 benchmarking	Manual	Built-in
CI fails on latency regression	Not native	`mlbuild ci-check`
CI fails on model size regression	Not native	`--size-threshold`
Task-aware synthetic inputs	No	Auto-detected
NLP multi-seq-len benchmarking	No	Built-in
Optimization sweep (fp16 + int8)	No	`mlbuild explore`
Static INT8 with calibration data	No	`--calibration-data`
Magnitude pruning (ONNX + CoreML)	No	`mlbuild optimize --pass prune`
Output divergence checking	No	`mlbuild accuracy`
Optimization chain visualization	No	`mlbuild log --tree`
Quantization tradeoff analysis	No	`mlbuild compare-quantization`
Performance reports	No	`mlbuild report`
S3-compatible remote storage	No	Built-in
TFLite benchmarking	No	Built-in
Import pre-built models	No	`mlbuild import`

MLBuild complements your existing stack — it doesn't replace it.

Installation

pip install mlbuild
mlbuild doctor

For TFLite support:

pip install "mlbuild[tflite]"

For S3 remote storage:

pip install "mlbuild[s3]"

For macOS (CoreML + TFLite full stack):

pip install "mlbuild[macos]"

For Linux / CI (TFLite only, no CoreML):

pip install "mlbuild[linux]"

Quick Start

# 1. Build and convert model
mlbuild build --model model.onnx --target apple_m1 --quantize fp16

# 1b. Or import a pre-built model
mlbuild import --model model.tflite --target android_arm64
mlbuild import --model model.mlpackage --target apple_m1 --quantize fp16

# 2. Benchmark (automatic p50/p95/p99, task auto-detected)
mlbuild benchmark <build-id>

# 3. Sweep all optimization variants automatically
mlbuild explore model.onnx --target apple_m1

# 4. Check output divergence between variants
mlbuild accuracy <baseline-id> <candidate-id>

# 5. Validate SLAs (performance + accuracy in one command)
mlbuild validate <build-id> --max-latency 10 --dataset ./imagenet-mini/

# 6. Run full CI check against registered baseline
mlbuild ci --model model.onnx --baseline main-mobilenet

# 6b. Or use low-level compare
mlbuild compare baseline candidate --threshold 5 --check-accuracy --ci

# 7. View full optimization lineage
mlbuild log --source model.onnx --tree

# 8. Generate performance report
mlbuild report <build-id> --open

# 9. Tag for production
mlbuild tag create <build-id> production

GitHub Actions Integration

- name: MLBuild CI
  run: |
    pip install mlbuild

    # Full CI check — explore, compare, report in one command
    mlbuild ci \
      --model models/mobilenet.onnx \
      --baseline main-mobilenet \
      --latency-regression 15 \
      --size-regression 10

- name: Upload CI report
  uses: actions/upload-artifact@v4
  if: always()
  with:
    name: mlbuild-report
    path: .mlbuild/ci_report.json

See .github/workflows/mlbuild.yml for a complete example with PR comment posting.

Documentation

Core Commands

Build and Convert

mlbuild build --model model.onnx --target apple_m1 --quantize fp16 --name "v2.0"
mlbuild build --model model.onnx --target android_arm64 --quantize int8

Import Pre-built Models

Register an existing TFLite or CoreML model directly — no conversion required. Once imported, all MLBuild commands (benchmark, profile, compare, report, ci-check) work on it immediately.

# Import a TFLite model
mlbuild import --model model.tflite --target android_arm64

# Import a CoreML model
mlbuild import --model model.mlpackage --target apple_m1

# Import an ONNX model (benchmarked via ONNX Runtime)
mlbuild import --model model.onnx --target onnxruntime_cpu
mlbuild import --model model.onnx --target onnxruntime_gpu

# Import with metadata
mlbuild import --model model.tflite --target android_arm64 \
  --quantize int8 \
  --name "vendor-v2" \
  --notes "Supplied by vendor, int8 quantized"

# JSON output (for CI pipelines)
mlbuild import --model model.tflite --target android_arm64 --json

Supported formats:

.onnx — validated via protobuf check, runs via ONNX Runtime
.tflite — validated via FlatBuffer magic bytes (TFL3/TFL2)
.mlpackage — validated via Manifest.json + Data/ structure
.mlmodel — legacy CoreML flat file

Format/target compatibility:

Format	Valid Targets
`onnx`	`onnxruntime_cpu`, `onnxruntime_gpu`, `onnxruntime_ane`
`tflite`	`android_arm64`, `android_arm32`, `android_x86`, `raspberry_pi`, `coral_tpu`, `generic_linux`
`coreml`	`apple_m1`, `apple_m2`, `apple_m3`, `apple_a15`, `apple_a16`, `apple_a17`

Imported builds are marked [imported] in mlbuild log output and tracked with "imported": true in their metadata.

Optimize

Generate optimized variants of a registered build. Supports quantization and magnitude pruning. All variants are registered as children of the source build with full lineage tracking.

Quantization

# FP16 — recompiles from ONNX graph (lower precision weights)
mlbuild optimize <build_id> --pass quantize --method fp16

# Dynamic range INT8 — weight-only, no calibration data needed
mlbuild optimize <build_id> --pass quantize --method int8

# Static INT8 — quantizes weights + activations using calibration data
mlbuild optimize <build_id> --pass quantize --method int8 \
  --calibration-data ./imagenet-mini/

Calibration data formats for static INT8:

Directory of images (.jpg, .png, .bmp, .webp) — auto-resized to model input shape, normalized to [0, 1]
Directory of .npy files — one array per sample
Single .npz file — named array, first axis = samples

Static and dynamic INT8 are stored as distinct builds (int8 vs int8_static) — both can coexist in the registry with different build IDs.

Note: Full static INT8 (weight + activation quantization) requires coremltools 9.1+. On 9.0, MLBuild automatically falls back to dynamic range INT8 with a clear warning — no crash, no silent misbehavior.

Pruning

Magnitude-based unstructured weight pruning. Zeros out the smallest weights by absolute value up to a target sparsity level. No retraining required.

# 50% sparsity
mlbuild optimize <build_id> --pass prune --sparsity 0.5

# 75% sparsity
mlbuild optimize <build_id> --pass prune --sparsity 0.75

Routing logic:

has_graph=True → ONNX magnitude pruning → re-convert via existing build pipeline (works for CoreML and TFLite)
has_graph=False + coreml → CT9 OpMagnitudePrunerConfig post-hoc on compiled .mlpackage
has_graph=False + tflite → Error with actionable message (Re-register using 'mlbuild build' or 'mlbuild import --graph model.onnx')

Pruning skips bias, batch norm, and small tensors (< 256 params) automatically. Sparsity level is baked into the method name (prune_0.50), so each level gets a distinct build ID.

Method chaining

Pruning and quantization can be chained arbitrarily:

# Prune first, then quantize
mlbuild optimize <build_id> --pass prune --sparsity 0.5
mlbuild optimize <pruned_build_id> --pass quantize --method int8

Explore

Sweeps all optimization variants for a model in one command. Builds the fp32 baseline, generates fp16 and int8 variants, benchmarks all of them, and assigns verdicts.

# Full sweep (fp16 + int8, all backends)
mlbuild explore model.onnx --target apple_m1

# Fast mode (fp16 only, 20 benchmark runs)
mlbuild explore model.onnx --target apple_m1 --fast

# Specific backends
mlbuild explore model.onnx --backends coreml
mlbuild explore model.onnx --backends coreml,tflite

# With static INT8 calibration data
mlbuild explore model.onnx --calibration-data ./imagenet-mini/

# With output divergence check per variant
mlbuild explore model.onnx --check-accuracy --cosine-threshold 0.99

# JSON output
mlbuild explore model.onnx --output-json

Verdict logic (score-based):

score = 0.6 × (baseline_latency / variant_latency)
      + 0.4 × (baseline_size / variant_size)

score > 1.0  → candidate for recommended or aggressive
score ≤ 1.0  → skip (strictly worse on both axes)

recommended — highest composite score (best balance of speed + size)
aggressive — smallest size among remaining candidates
skip — no improvement, or accuracy check failed
baseline — fp32 reference

COREML
  Verdict       Method         Size      p50 Latency   vs Baseline    Accuracy
  baseline      fp32           13.39 MB  3.29ms        —              —
  aggressive    fp16            6.74 MB  3.29ms        ↑0% lat        —
                                                       ↓50% size
  recommended   int8(static)    3.58 MB  2.81ms        ↓14% lat       ✓ 0.9999
                                                       ↓73% size

Accuracy

Standalone output divergence check between two builds. Runs inference on both with synthetic inputs and computes similarity metrics.

mlbuild accuracy <baseline_id> <candidate_id>
mlbuild accuracy <baseline_id> <candidate_id> --samples 64 --seed 42
mlbuild accuracy <baseline_id> <candidate_id> \
  --cosine-threshold 0.99 \
  --top1-threshold 0.99

Metrics:

Cosine similarity — angle between output vectors (1.0 = identical direction)
Mean absolute error — average per-element absolute difference
Max absolute error — worst-case per-element difference
Top-1 agreement — fraction of samples where both models pick the same top class (classifiers only)

Results are persisted to the registry's accuracy_checks table.

Example results on MobileNet:

fp32 → fp16:  cosine=0.9999  top1=1.00   passed=True
fp32 → int8:  cosine=0.9983  top1=0.97   passed=False (< 0.99 threshold)

Benchmark

mlbuild benchmark <build-id> --runs 100 --warmup 20 --json
mlbuild benchmark <build-id> --compute-unit CPU_ONLY

Validate SLAs

Validates a build against performance and accuracy constraints. All checks compose in a single command.

# Performance constraints only
mlbuild validate <build_id> --max-latency 10 --max-size 8

# Accuracy constraint with dataset
mlbuild validate <build_id> \
  --dataset ./imagenet-mini/ \
  --cosine-threshold 0.99 \
  --top1-threshold 0.99

# All checks composed
mlbuild validate <build_id> \
  --max-latency 10 \
  --max-size 8 \
  --dataset ./imagenet-mini/ \
  --cosine-threshold 0.99

# CI mode (suppress output, exit codes only)
mlbuild validate <build_id> --max-latency 5 --ci

Options:

--max-latency — maximum p50 latency in ms
--max-p95 — maximum p95 latency in ms
--max-memory — maximum peak memory in MB
--max-size — maximum model size in MB
--dataset — calibration data for accuracy check (images dir, .npy dir, or .npz)
--baseline-id — reference build for accuracy comparison (default: root build)
--cosine-threshold — minimum cosine similarity (default: 0.99)
--top1-threshold — minimum top-1 agreement (default: 0.99)
--accuracy-samples — max calibration samples (default: 200)

If --dataset is provided but the build is the root (no baseline to compare against), accuracy check is skipped with a message rather than erroring.

Exit codes: 0 = all constraints passed, 1 = one or more violations.

Compare and Detect Regressions

# Compare with independent latency + size thresholds
mlbuild compare baseline candidate \
  --threshold 5 \
  --size-threshold 10 \
  --metric p95 \
  --ci

# Use cached benchmark results (skip re-benchmarking)
mlbuild compare baseline candidate --use-cached

# Include output divergence check
mlbuild compare baseline candidate --check-accuracy

# Dedicated CI gate
mlbuild ci-check baseline candidate
mlbuild ci-check baseline candidate --latency-threshold 10 --size-threshold 5
mlbuild ci-check baseline candidate --strict   # any positive delta fails
mlbuild ci-check baseline candidate --json

Exit codes:

0 — no regression (safe to ship)
1 — regression detected (block the PR)
2 — error (infra failure, check logs)

CI Orchestration

Full CI check in one command — resolves baseline, explores variants, compares, enforces thresholds, and writes a structured report.

# Run full CI check against a tagged baseline
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet

# Use an existing build (skips explore — useful when builds happen earlier in pipeline)
mlbuild ci --build <build_id> --baseline main-mobilenet

# With absolute budgets (independent of baseline)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet \
  --latency-budget 3.0 \
  --size-budget 10.0

# With accuracy gate
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet \
  --dataset ./imagenet-mini/ \
  --cosine-threshold 0.99

# JSON output (for dashboards and GitHub bots)
mlbuild ci --build <build_id> --baseline main-mobilenet --json

# Fail if baseline tag not found (strict CI)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet --fail-on-missing-baseline

Tagging baselines:

# Tag a build as the main branch baseline
mlbuild tag create <build_id> main-mobilenet

# --baseline accepts tag names or build ID prefixes
mlbuild ci --baseline main-mobilenet   # tag lookup
mlbuild ci --baseline 3f36810e         # build ID prefix

Options:

Flag	Description	Default
`--model`	ONNX model path (runs explore)	—
`--build`	Existing build ID (skips explore)	—
`--baseline`	Tag name or build ID	required
`--target`	Device target for explore	auto
`--latency-regression`	Max latency regression %	10.0
`--size-regression`	Max size regression %	5.0
`--latency-budget`	Hard latency cap in ms	none
`--size-budget`	Hard size cap in MB	none
`--dataset`	Calibration data for accuracy check	none
`--cosine-threshold`	Min cosine similarity	0.99
`--top1-threshold`	Min top-1 agreement	0.99
`--fail-on-missing-baseline`	Exit 1 if baseline not found	false
`--json`	Print JSON report to stdout	false

CI Report — always written to .mlbuild/ci_report.json:

{
  "model": "mobilenet.onnx",
  "baseline": {
    "tag": "main-mobilenet",
    "build_id": "3f36810e...",
    "latency_ms": 2.49,
    "size_mb": 13.39
  },
  "candidate": {
    "build_id": "b8aa1ef6...",
    "variant": "fp16",
    "parent_build_id": "3f36810e...",
    "latency_ms": 0.74,
    "size_mb": 6.74
  },
  "delta": { "latency_pct": -70.27, "size_pct": -49.64 },
  "thresholds": {
    "latency_regression_pct": 10.0,
    "size_regression_pct": 5.0,
    "latency_budget_ms": null,
    "size_budget_mb": null
  },
  "result": "pass",
  "violations": []
}

The report always stores baseline.build_id — even if the tag is later repointed, the report preserves exactly what was compared.

Exit codes: 0 = pass or skipped, 1 = regression/failure, 2 = error.

Configuration via .mlbuild/config.toml:

[ci]
latency_regression_pct = 10
size_regression_pct = 5
latency_budget_ms = 3.0
size_budget_mb = 10.0

[ci.accuracy]
cosine_threshold = 0.99
top1_threshold = 0.99

Quantization Tradeoff Analysis

mlbuild compare-quantization fp32-build int8-build
mlbuild compare-quantization fp32-build int8-build --accuracy-samples 100
mlbuild compare-quantization fp32-build int8-build --json

Performance Report

mlbuild report <build-id>
mlbuild report <build-id> --open
mlbuild report <build-id> --output report.html
mlbuild report <build-id> --format pdf        # requires: pip install weasyprint

Deep Profiling

# TFLite: full 6-feature deep profile (no device required)
mlbuild profile <build-id> --deep

# CoreML: cold start decomposition (all formats)
mlbuild profile <build-id> --deep

# Options
mlbuild profile <build-id> --deep --top 20
mlbuild profile <build-id> --deep --runs 100
mlbuild profile <build-id> --deep --int8-build <id>  # TFLite: quant sensitivity

TFLite deep profiling features (--deep):

#	Feature	Description
①	Per-op timing	Real hardware timing via TFLite's built-in op profiler
②	Memory flow	Activation memory at each layer boundary, peak flagged
③	Bottleneck classification	COMPUTE vs MEMORY bound per op (arithmetic intensity)
④	Cold start decomposition	Load → first inference → stable, with warmup sparkline
⑤	Quantization sensitivity	Per-layer fp32 vs int8 divergence (requires `--int8-build`)
⑥	Fusion detection	Fused kernels identified + missed fusion opportunities flagged

Build History

# All builds
mlbuild log

# Specific build detail
mlbuild log <build_id>

# Filter by source model filename (substring match)
mlbuild log --source mobilenet.onnx

# Full optimization lineage tree (recursive parent-child)
mlbuild log --source mobilenet.onnx --tree

# Other filters
mlbuild log --name mobilenet
mlbuild log --format coreml
mlbuild log --task vision
mlbuild log --roots-only
mlbuild log --target apple_m1

# Export
mlbuild log --json
mlbuild log --csv builds.csv

The --tree flag renders the full optimization DAG using actual parent-child lineage. Method chaining (e.g. prune → int8) shows as nested children, not flat siblings — causality is preserved:

3f36810e  mobilenet  coreml  fp32  13.39 MB  2.49ms
├── b8aa1ef6  coreml  fp16  6.74 MB  0.74ms
├── 2921f0fa  coreml  int8  3.58 MB  3.15ms
├── 9df061cb  coreml  int8(static)  3.58 MB  2.81ms
├── 329f3b78  coreml  prune(0.50)  13.39 MB  3.89ms
│   └── 3fa93712  coreml  int8  3.58 MB  2.61ms
└── 0a17ce03  coreml  prune(0.75)  13.39 MB  2.94ms

Method labels are human-readable: prune(0.50), int8(static) instead of raw internal strings.

Command History

A permanent log of every MLBuild command ever run. Searchable, filterable, deletable.

# Show all recent commands
mlbuild history

# Filter by command type — every command is filterable
mlbuild history --filter build
mlbuild history --filter benchmark
mlbuild history --filter validate
mlbuild history --filter baseline
mlbuild history --filter budget
mlbuild history --filter status
mlbuild history --filter import
mlbuild history --filter compare
mlbuild history --filter failed
# ...and all other commands (accuracy, ci, diff, explore, optimize, profile, etc.)

# Filter by time
mlbuild history --since yesterday
mlbuild history --since "7 days ago"
mlbuild history --since "2024-01-01"

# Filter by build ID — everything that touched a specific build
mlbuild history --build-id a3f91c2

# Limit results
mlbuild history --limit 100

# Delete one entry by ID (min 4 chars)
mlbuild history delete d58cc62f

# Clear all history (prompts for confirmation)
mlbuild history clear

History is an audit log of CLI actions — separate from build and benchmark data. Deleting a history entry never touches builds or benchmarks.

Performance Budget

Persistent performance constraints committed to git. Set once, enforced automatically by mlbuild validate and mlbuild ci. Explicit flags always override budget values.

# Set constraints once
mlbuild budget set --max-latency 10 --max-p95 15 --max-size 8

# Show current budget
mlbuild budget show

# Preview what would apply to a build without benchmarking
mlbuild budget validate <build_id>

# Update one constraint without touching others
mlbuild budget set --max-latency 5

# Remove one constraint
mlbuild budget clear --constraint max-latency

# Remove all constraints (prompts for confirmation)
mlbuild budget clear

# After budget is set, validate uses it automatically
mlbuild validate <build_id>           ← uses budget
mlbuild validate <build_id> --max-latency 3  ← overrides latency, budget for rest

Budget is stored in .mlbuild/budget.toml — commit it so your whole team enforces the same constraints automatically.

Merge priority: explicit CLI flag > budget file > no constraint

Violation output shows the source of each constraint:

┃ Constraint     ┃   Limit ┃  Actual ┃     Violation       ┃        Source ┃
│ max_latency_ms │ 1.00 ms │ 2.66 ms │ +1.66 (166% over)   │ explicit flag │
│ max_size_mb    │ 8.00 MB │ 9.10 MB │ +1.10 (13.8% over)  │ budget        │

Baseline Management

Clean UX wrapper around mlbuild tag. Uses the reserved tag mlbuild-baseline so mlbuild ci resolves it automatically — zero CI changes required.

# Set a build as the performance baseline
mlbuild baseline set <build_id>

# Show current baseline
mlbuild baseline
mlbuild baseline show

# Show all baseline-style tags (mlbuild-baseline, main-*, production-*)
mlbuild baseline history

# Remove baseline (prompts for confirmation)
mlbuild baseline unset

The baseline integrates directly with mlbuild ci:

mlbuild ci --model model.onnx --baseline mlbuild-baseline

Prompts before overwriting an existing baseline. Use --force to skip the prompt.

Workspace Status

Quick health check of the current workspace. Reads from existing data — no new storage.

mlbuild status
mlbuild status --json

Output:

MLBuild Status  Abdoulayes-MacBook-Air.local

  ✓ Workspace    .mlbuild/
  ✓ Registry     26 builds  |  18 benchmarks
  Last build:  mobilenet (coreml, 3.58 MB) — 2h ago
  Last bench:  p50=2.61 ms — 2h ago

  ✓ Baseline     3fa9371209e6  mobilenet  2.61 ms  3.58 MB
  Last validate: PASSED — 52m ago

  ✓ Budget       .mlbuild/budget.toml
    Max latency (p50)    10.0 ms
    Max size             8.0 MB

Version Management

mlbuild log --limit 20
mlbuild diff build-a build-b
mlbuild tag create <build-id> v1.0.0

Experiment Tracking

mlbuild experiment create "quantization-search"
mlbuild run start --experiment "quantization-search"
mlbuild run log-param quantization int8
mlbuild run log-metric latency_p50 5.6
mlbuild run end

Remote Storage

# Set up S3-compatible remote (one-time)
mlbuild remote add prod \
  --backend s3 \
  --bucket your-bucket \
  --region us-east-1

# Push/pull/sync builds
mlbuild push <build-id>
mlbuild pull <build-id>
mlbuild sync

Supported backends: AWS S3, Cloudflare R2 (recommended — free 10 GB), Backblaze B2, any S3-compatible storage.

Task-Aware Benchmarking

MLBuild automatically detects what kind of model you're benchmarking — vision, NLP, or audio — and generates semantically correct synthetic inputs for it. No dummy zero arrays, no manual shape specification.

Automatic Task Detection

Detection runs through three tiers in order of confidence:

Tier	Method	Formats	Confidence	CLI Behavior
Graph	Op/layer analysis (`Conv`, `Attention`, `STFT`, etc.)	ONNX, TFLite, CoreML	High	Silent
Name	Tensor name heuristics (`input_ids`, `pixel_values`, `mel`)	All	Medium	Warning
Shape	Dtype + rank heuristics (rank-4 float = vision, rank-2 int = NLP)	All	Low	Warning + zeros fallback

# High confidence — silent, correct inputs generated automatically
mlbuild benchmark <build-id>

# Medium confidence — warning printed, benchmark proceeds
# ⚠  Task auto-detected as 'nlp' (medium confidence)
#    If incorrect, re-run with: --task vision|nlp|audio
mlbuild benchmark <build-id>

# Low confidence or unknown — zeros used as safe fallback
# ⚠  Task could not be detected — running with zero tensors
mlbuild benchmark <build-id>

Override with `--task`

mlbuild benchmark <build-id> --task vision
mlbuild benchmark <build-id> --task nlp
mlbuild benchmark <build-id> --task audio

mlbuild profile  <build-id> --task nlp
mlbuild validate <build-id> --task vision --strict-output

Task-Specific Synthetic Inputs

Task	Inputs Generated
Vision	Float32 image tensor, NCHW layout, spatial dims resolved to 224×224
NLP	`int64` token IDs (random vocab up to 30k), `int64` attention mask (all ones), token type IDs
Audio	Float32 waveform `[-1, 1]` or log-mel spectrogram — role inferred from tensor name/shape
Unknown	Zero tensors — safe fallback that never blocks CI

NLP Multi-Sequence Benchmarking

NLP models are benchmarked across a sequence length ladder by default:

# Default ladder: [16, 64, 128, 256]
mlbuild benchmark <build-id> --task nlp

# seq_len=16   p50=1.2ms  p95=1.4ms
# seq_len=64   p50=2.1ms  p95=2.4ms
# seq_len=128  p50=3.8ms  p95=4.2ms
# seq_len=256  p50=7.1ms  p95=8.0ms

# Clip to model's actual max sequence length
mlbuild benchmark <build-id> --task nlp --seq-len 128

Strict Output Validation

# Soft mode (default) — warns but proceeds
mlbuild benchmark <build-id> --task nlp

# Strict mode — exits non-zero on output anomaly
mlbuild benchmark <build-id> --task nlp --strict-output
mlbuild validate  <build-id> --task vision --strict-output

# Global strict mode — applies to all commands
mlbuild --strict-output benchmark <build-id> --task nlp

Optimization Workflow

A complete optimization workflow from ONNX to deployment-ready model:

# 1. Build FP32 baseline
mlbuild build --model mobilenet.onnx --target apple_m1 --name mobilenet

# 2. Sweep all variants automatically
mlbuild explore mobilenet.onnx --target apple_m1 --check-accuracy

# 3. Prune best variant and quantize the result
mlbuild optimize <fp32_id> --pass prune --sparsity 0.5
mlbuild optimize <pruned_id> --pass quantize --method int8

# 4. Validate final model against SLAs
mlbuild validate <final_id> \
  --max-latency 5 \
  --max-size 6 \
  --dataset ./imagenet-mini/

# 5. View full lineage
mlbuild log --source mobilenet.onnx --tree

# 6. Tag for production
mlbuild tag create <final_id> production-v2

CI/CD Regression Gate

# Full CI orchestration (recommended)
mlbuild ci --model mobilenet.onnx --baseline main-mobilenet
echo "Exit: $?"   # 0 = pass, 1 = fail, 2 = error

# Low-level build-to-build comparison
mlbuild ci-check $BASELINE_ID $CANDIDATE_ID
echo "Exit: $?"   # 0 = pass, 1 = regression, 2 = error

# JSON output for dashboards and PR bots
mlbuild ci --build $BUILD_ID --baseline main-mobilenet --json
# {
#   "result": "pass",
#   "baseline": { "tag": "main-mobilenet", "build_id": "3f36810e...", "latency_ms": 2.49 },
#   "candidate": { "build_id": "b8aa1ef6...", "variant": "fp16", "latency_ms": 0.74 },
#   "delta": { "latency_pct": -70.27, "size_pct": -49.64 },
#   "violations": []
# }

Architecture

Training Phase
├── Experiment Tracking:   MLflow / W&B / Neptune
└── Data Versioning:       DVC

              ↓

Production Phase
├── Model Building:         MLBuild build
├── Model Importing:        MLBuild import          ← pre-built TFLite / CoreML
├── Task Detection:         MLBuild (automatic)     ← vision / nlp / audio
├── Optimization Sweep:     MLBuild explore         ← fp16 + int8 + pruning
├── Accuracy Validation:    MLBuild accuracy        ← output divergence
├── Performance Validation: MLBuild ci-check        ← regression gate
├── Quantization Analysis:  MLBuild compare-quantization
├── Reporting:              MLBuild report
└── Deployment:             GitHub Actions / K8s

How It Works

1. Deterministic Builds

# Content-addressed storage (Git-style)
build_id = sha256(source_hash + config_hash + env_fingerprint)
# Same inputs = Same output (byte-for-byte)

2. Build Lineage Tracking

Every variant stores its full ancestry:

build.parent_build_id      # direct parent
build.root_build_id        # original source in the chain
build.optimization_method  # "fp16", "int8", "int8_static", "prune_0.50"

Identical optimization chains always produce the same build ID — deduplication is automatic.

3. Automated Benchmarking

# Runs N iterations with warmup
# Calculates p50, p95, p99, mean, std
# Measures memory RSS delta, throughput
# Outlier trimming (top/bottom 5%)

4. Task-Aware Input Generation

# Three-tier detection: graph ops → tensor names → shapes
# Task-specific synthetic inputs (never zeros for known tasks)
# NLP: multi-seq-len ladder [16, 64, 128, 256]
# Post-inference output validation with configurable strictness

5. Output Divergence Checking

# Cosine similarity — output direction preservation
# MAE / max absolute error — per-element differences
# Top-1 agreement — classifier label consistency
# Streaming accumulators — memory-efficient over large batches
# Results persisted to accuracy_checks registry table

6. Dual Regression Detection

# Independent thresholds for latency and size
latency_regression = latency_change_pct > latency_threshold
size_regression    = size_change_pct    > size_threshold
regression_detected = latency_regression or size_regression

7. Explore Verdict Scoring

score = 0.6 * (baseline_latency / variant_latency) \
      + 0.4 * (baseline_size / variant_size)
# score > 1.0 → candidate for recommended/aggressive
# score ≤ 1.0 → skip (strictly worse on both axes)

Features

Build and Convert

ONNX → CoreML conversion (Apple Silicon, A-series)
ONNX → TFLite conversion (Android arm64)
Quantization: FP32 / FP16 / INT8
Deterministic builds (content-addressed)
ONNX graph storage for re-conversion

Import Pre-built Models

Import existing .onnx, .tflite, .mlmodel, .mlpackage files directly
ONNX import runs via ONNX Runtime — onnxruntime_cpu, onnxruntime_gpu, onnxruntime_ane targets
Format validation via protobuf check (ONNX), magic bytes (TFLite), structure checks (CoreML)
Tier 1 task detection for all import formats — ONNX via graph ops, TFLite via FlatBuffer parsing, CoreML via coremltools spec
Format/target compatibility enforcement
Imported builds tracked with [imported] badge in mlbuild log
Full MLBuild toolchain available immediately after import

Optimization

FP16 quantization — recompilation from ONNX graph
Dynamic range INT8 — weight-only, no calibration data needed
Static INT8 — weights + activations quantized using representative calibration data; gracefully falls back to dynamic range on coremltools 9.0
Magnitude pruning — global threshold-based, ONNX path works for both CoreML and TFLite, CoreML post-hoc path for imported models
Method chaining — prune → quantize, any depth
Distinct build IDs per optimization level (int8_static ≠ int8, prune_0.50 ≠ prune_0.75)
Deduplication — identical optimization chains reuse existing builds

Optimization Sweep

mlbuild explore — single command sweeps fp16 + int8 across all backends
Score-based verdict assignment (recommended / aggressive / skip / baseline)
Accuracy check per variant with --check-accuracy — failed variants get skip verdict
Calibration data support with --calibration-data for static INT8 in sweep
Fast mode (--fast) — fp16 only, 20 benchmark runs

Accuracy / Output Divergence

Cosine similarity, MAE, max absolute error, top-1 agreement
Dtype-aware random input generation
precomputed_batch — inputs generated once, reused across all variants in explore
Results persisted to accuracy_checks registry table
Standalone mlbuild accuracy command
Integrated into mlbuild compare --check-accuracy and mlbuild explore --check-accuracy

Task-Aware Benchmarking

Three-tier automatic task detection (graph ops → tensor names → shapes)
Task-specific synthetic inputs: real image tensors, token IDs + attention masks, waveforms/spectrograms
NLP multi-sequence-length benchmarking ladder [16, 64, 128, 256]
Configurable --task override for explicit control
Post-inference output validation with soft/strict modes (--strict-output)

Performance Validation

Automated p50/p95/p99 benchmarking
SLA enforcement (--max-latency, --max-p95, --max-memory, --max-size)
Accuracy validation via --dataset (calibration data), composes with performance checks
Baseline accuracy comparison with --baseline-id (defaults to root build)
Root builds skip accuracy check gracefully rather than erroring

Deep Profiling (`--deep`)

TFLite: Per-op timing (real hardware), tensor memory flow, COMPUTE/MEMORY bottleneck classification, cold start decomposition, per-layer quantization sensitivity (fp32 vs int8), op fusion detection
CoreML: Cold start decomposition (all formats); per-layer timing, memory flow, bottleneck classification, and fusion detection (NeuralNetwork format only)

Build History and Lineage

mlbuild log --source — filter builds by source model filename
mlbuild log --tree — recursive parent-child DAG — causality preserved across optimization chains
Human-readable method labels in tree: prune(0.50), int8(static)
Filter by name, format, task, target, date range, roots-only
JSON and CSV export

Command History

mlbuild history — permanent audit log of every CLI command ever run
Searchable by command type, time window, build ID
Filterable: build, benchmark, validate, compare, profile, failed
Delete individual entries or clear all — never affects build or benchmark data
Machine identity captured on every row — ready for cross-machine team view when cloud login lands

Performance Budget

mlbuild budget set/show/clear/validate — persistent constraint management
Stored in .mlbuild/budget.toml — commit to git for team-wide enforcement
Merge logic: explicit CLI flag > budget > no constraint
Constraint source shown in violation output (budget vs explicit flag)
All four constraints: max_latency_ms, max_p95_ms, max_memory_mb, max_size_mb
Applied automatically by mlbuild validate and mlbuild ci
budget validate <build_id> — dry run, evaluates size immediately, flags latency as pending

Baseline Management

mlbuild baseline set/show/unset/history — clean UX wrapper around mlbuild tag
Uses reserved tag mlbuild-baseline — integrates with mlbuild ci automatically
Prompts before overwriting existing baseline
baseline history — shows all baseline-style tags: mlbuild-baseline, main-*, production-*

Workspace Status

mlbuild status — instant workspace health snapshot
Shows build/benchmark counts, last build, last benchmark, last validate result
Shows current baseline and active budget constraints
JSON output via --json for scripting

Performance Reports

Self-contained HTML (no external dependencies)
Benchmark history table
Related builds comparison
Deployment recommendations
Optional PDF export (requires weasyprint)

Remote Storage

S3-compatible backends (AWS, R2, B2)
Git-style push/pull/sync
Integrity verification (SHA-256)

CI/CD Integration

mlbuild ci — full CI orchestration (explore + compare + threshold enforcement + JSON report)
Tag-based baseline resolution — mlbuild tag create <id> main-mobilenet
Baseline immutability — report stores both tag name and build ID for reproducibility
Baseline benchmark guard — auto-benchmarks baseline if no cached latency
Relative regression thresholds (--latency-regression, --size-regression)
Absolute budget constraints (--latency-budget, --size-budget) independent of baseline
Accuracy gate via --dataset — cosine similarity + top-1 agreement
--fail-on-missing-baseline — strict mode for production pipelines
Structured JSON report at .mlbuild/ci_report.json — readable by GitHub bots, dashboards, Slack
mlbuild ci-check — low-level build-to-build regression gate
Exit codes: 0 (pass/skip) / 1 (regression/fail) / 2 (error)
GitHub Actions workflow with artifact upload and PR comment posting (.github/workflows/mlbuild.yml)

Project Structure

mlbuild/
├── src/mlbuild/
│   ├── cli/
│   │   ├── commands/
│   │   │   ├── accuracy.py               # mlbuild accuracy
│   │   │   ├── baseline.py               # mlbuild baseline
│   │   │   ├── benchmark.py              # mlbuild benchmark
│   │   │   ├── budget.py                 # mlbuild budget
│   │   │   ├── build.py                  # mlbuild build
│   │   │   ├── ci.py                     # mlbuild ci + ci-check
│   │   │   ├── compare.py                # mlbuild compare
│   │   │   ├── compare_compute_units.py  # mlbuild compare-compute-units
│   │   │   ├── compare_quantization.py   # mlbuild compare-quantization
│   │   │   ├── diff.py                   # mlbuild diff
│   │   │   ├── doctor.py                 # mlbuild doctor
│   │   │   ├── experiment.py             # mlbuild experiment
│   │   │   ├── explore.py                # mlbuild explore
│   │   │   ├── history.py                # mlbuild history
│   │   │   ├── import_cmd.py             # mlbuild import
│   │   │   ├── log.py                    # mlbuild log
│   │   │   ├── optimize.py               # mlbuild optimize
│   │   │   ├── profile.py                # mlbuild profile
│   │   │   ├── pull.py                   # mlbuild pull
│   │   │   ├── push.py                   # mlbuild push
│   │   │   ├── remote.py                 # mlbuild remote
│   │   │   ├── report.py                 # mlbuild report
│   │   │   ├── run.py                    # mlbuild run
│   │   │   ├── status.py                 # mlbuild status
│   │   │   ├── sync.py                   # mlbuild sync
│   │   │   ├── tag.py                    # mlbuild tag
│   │   │   └── validate.py               # mlbuild validate
│   │   └── main.py                       # CLI entry point
│   ├── backends/
│   │   ├── base.py                       # Backend base class
│   │   ├── registry.py                   # Backend auto-discovery
│   │   ├── coreml/                       # CoreML exporter + deep profiler
│   │   ├── tflite/                       # TFLite backend + deep profiler
│   │   └── onnxruntime/                  # ONNX Runtime backend
│   ├── benchmark/
│   │   ├── runner.py                     # Benchmark runner + stats
│   │   └── device_runner.py              # Device benchmark runner
│   ├── core/
│   │   ├── budget.py                     # Budget load/save/merge/validate
│   │   ├── accuracy/
│   │   │   ├── calibration.py            # CalibrationLoader (images/npy/npz)
│   │   │   ├── checker.py                # run_accuracy_check()
│   │   │   ├── config.py                 # AccuracyConfig, AccuracyResult
│   │   │   ├── inputs.py                 # InputSpec, generate_batch
│   │   │   └── metrics.py                # cosine_similarity, MAE, top-1
│   │   ├── ci/
│   │   │   ├── reporter.py               # CIReport + text/JSON/markdown formatters
│   │   │   ├── runner.py                 # CIRunner orchestration
│   │   │   └── thresholds.py             # ThresholdConfig + violation evaluation
│   │   ├── environment.py                # Environment fingerprinting
│   │   ├── errors.py                     # Error types
│   │   ├── format_detection.py           # Format detection + target validation
│   │   ├── hash.py                       # Deterministic artifact hashing
│   │   ├── ir.py                         # ModelIR — format-agnostic model graph
│   │   ├── machine.py                    # Machine identity (UUID + hostname)
│   │   ├── task_detection.py             # Three-tier task detection
│   │   ├── task_inputs.py                # Task-aware synthetic input generation
│   │   ├── task_validation.py            # Post-inference output validation
│   │   ├── tasks.py                      # Task types + arbitration + output schemas
│   │   └── types.py                      # Build, Benchmark, VariantResult dataclasses
│   ├── experiments/                      # Experiment + run tracking
│   ├── explore/
│   │   └── explorer.py                   # explore(), assign_verdicts(), accuracy integration
│   ├── loaders/
│   │   ├── loader.py                     # Model loading entrypoint
│   │   └── onnx_loader.py                # ONNX loader + ModelIR builder
│   ├── optimize/
│   │   ├── optimizer.py                  # optimize() + prune() entrypoints
│   │   ├── passes/
│   │   │   ├── pruning.py                # PruningPass (ONNX + CoreML post-hoc)
│   │   │   └── quantization.py           # QuantizationPass (fp16/int8/int8_static)
│   │   └── backends/
│   │       ├── coreml_backend.py         # compile_from_graph, quantize_weights,
│   │       │                             # quantize_weights_static, prune_weights
│   │       └── tflite_backend.py         # quantize_from_graph
│   ├── profiling/
│   │   ├── cold_start.py                 # Cold start decomposition
│   │   ├── layer_profiler.py             # Per-layer timing
│   │   ├── memory_profiler.py            # Memory tracking
│   │   └── warmup_analyzer.py            # Warmup analysis
│   ├── registry/
│   │   ├── local.py                      # SQLite registry (WAL mode)
│   │   └── schema.py                     # Schema + migrations (v9)
│   ├── storage/
│   │   ├── backend.py                    # Storage backend interface
│   │   ├── config.py                     # Remote config
│   │   ├── local.py                      # Local storage
│   │   └── s3.py                         # S3-compatible storage
│   ├── validation/
│   │   └── accuracy_validator.py         # AccuracyValidator for mlbuild validate
│   └── visualization/
│       └── charts.py                     # Chart generation
├── tests/
├── pyproject.toml
└── README.md

vs. Existing Tools

Feature	Custom Scripts	Profilers	MLBuild
Hardware inference benchmarking	Manual	Partial	Automated
Performance regression detection	Custom	Manual	Built-in
CI performance gate	Custom	—	Built-in
Cross-device testing	Manual	—	Yes
Performance history & tracking	—	—	Built-in
CI-automated per-layer profiling	Custom	Manual	Automated
Quantization performance benchmarking	—	Manual	Automated
Auto-generated task inputs	—	—	Auto-detected
Performance reports	—	—	HTML/PDF

Use MLflow/W&B for training experiments. Use MLBuild for on-device inference performance.

Development

git clone https://github.com/AbdoulayeSeydi/mlbuild.git
cd mlbuild
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

pytest tests/

Contributing

See CONTRIBUTING.md for development setup, coding standards, and PR process.

License

MIT License — see LICENSE for details.

Roadmap

Phase 1 — Device-Connected Benchmarking (next)

Android ADB bridge — benchmark on connected Android devices without Android Studio
Xcode Instruments integration — real iPhone hardware profiling

Phase 2 — More Backends

TensorRT — NVIDIA GPU inference
Qualcomm QNN — Snapdragon NPU

Phase 3 — Cloud Benchmarking

Remote benchmark execution on cloud hardware

Built by Abdoulaye Seydi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Mar 19, 2026

0.2.0

Mar 25, 2026

0.1.9

Mar 22, 2026

0.1.6

Feb 25, 2026

0.1.5

Feb 25, 2026

0.1.4

Feb 25, 2026

0.1.3

Feb 25, 2026

0.1.2

Feb 24, 2026

0.1.1

Feb 24, 2026

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlbuild-0.3.0.tar.gz (302.6 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlbuild-0.3.0-py3-none-any.whl (322.6 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file mlbuild-0.3.0.tar.gz.

File metadata

Download URL: mlbuild-0.3.0.tar.gz
Upload date: Mar 19, 2026
Size: 302.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mlbuild-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`674e939d75f63173f597a49ea7d0d9e2c12099cd0da4ca49fc9b98c16ca5c960`
MD5	`924f089a05c1fde4104d205894e8c6e8`
BLAKE2b-256	`fbf4c36ed636bd3f24ec467eb8f1b9f7333d326467d477c939813c7280d241d0`

See more details on using hashes here.

File details

Details for the file mlbuild-0.3.0-py3-none-any.whl.

File metadata

Download URL: mlbuild-0.3.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 322.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mlbuild-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e0538a3846da5d8c87f3a3814ba15feb870c95d11081fa462a5a034707326e3`
MD5	`74a0ba1cb31c105a371500a234afff4c`
BLAKE2b-256	`a04f4d60a4b46a745a6cf2e2800d7cb662baa388f3530f51b6cdbe393c50e79e`

See more details on using hashes here.

MLBuild 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLBuild

Current Status

The Problem

The Solution

Where MLBuild Fits

Installation

Quick Start

GitHub Actions Integration

Documentation

Core Commands

Build and Convert

Import Pre-built Models

Optimize

Quantization

Pruning

Method chaining

Explore

Accuracy

Benchmark

Validate SLAs

Compare and Detect Regressions

CI Orchestration

Quantization Tradeoff Analysis

Performance Report

Deep Profiling

Build History

Command History

Performance Budget

Baseline Management

Workspace Status

Version Management

Experiment Tracking

Remote Storage

Task-Aware Benchmarking

Automatic Task Detection

Override with --task

Task-Specific Synthetic Inputs

NLP Multi-Sequence Benchmarking

Strict Output Validation

Optimization Workflow

CI/CD Regression Gate

Architecture

How It Works

1. Deterministic Builds

2. Build Lineage Tracking

3. Automated Benchmarking

4. Task-Aware Input Generation

5. Output Divergence Checking

6. Dual Regression Detection

7. Explore Verdict Scoring

Features

Build and Convert

Import Pre-built Models

Optimization

Optimization Sweep

Accuracy / Output Divergence

Task-Aware Benchmarking

Performance Validation

Deep Profiling (--deep)

Build History and Lineage

Command History

Performance Budget

Baseline Management

Workspace Status

Performance Reports

Remote Storage

CI/CD Integration

Project Structure

vs. Existing Tools

Development

Contributing

Override with `--task`

Deep Profiling (`--deep`)