Skip to main content

ML Systems Infrastructure Modeling Engine — first-principles analytical framework for ML workloads.

Project description

[!NOTE] 📌 Early release (2026)

MLSys·im shipped with the 2026 MLSysBook refresh. The modeling platform, APIs, and lab integrations are actively iterated as we harden the simulator and teaching workflows.

FeedbackGitHub issues or pull requests.

dev branch live site

🚀 MLSys·im: The Modeling Platform

The physics-grounded analytical simulator powering the Machine Learning Systems ecosystem.
Provides a unified "Single Source of Truth" (SSoT) for modeling systems from sub-watt microcontrollers to exaflop-scale global fleets.

🏗 The 5-Layer Analytical Stack

mlsysim implements a "Progressive Lowering" architecture, separating high-level workloads from the physical infrastructure that executes them.

Layer Domain Key Components
Layer A Workload Representation
mlsysim.models
FLOPs, parameters, and intensity.
e.g., Llama3_70B, ResNet50
Layer B Hardware Registry
mlsysim.hardware
Concrete specs for real-world silicon.
e.g., H100, TPUv5p, Jetson
Layer C Infrastructure
mlsysim.infra
Grid profiles and datacenter sustainability.
e.g., PUE, Carbon Intensity, WUE
Layer D Systems & Topology
mlsysim.systems
Fleet configurations and network fabrics.
e.g., Doorbell, AutoDrive Scenarios
Layer E Execution & Resolvers
mlsysim.core.solver
The 3-tier math engine: Models, Solvers, and Optimizers (Design space search).

🚀 Quick Usage: The Agent-Ready CLI

mlsysim is a first-principles analytical calculator for ML systems. It provides a terminal UI for humans and a strict JSON API for CI/CD pipelines and AI agents.

Accuracy note: mlsysim predictions are typically within 2–5× of measured performance for well-characterized workloads. For production capacity planning, always validate with benchmarks. This tool formalizes the back-of-envelope math that senior engineers do intuitively — it is not a substitute for profiling or load testing.

1. Explore the Registry (The Zoo)

Discover built-in hardware, models, and infrastructure without reading source code: mlsysim zoo hardware
mlsysim zoo models

2. Quick Evaluation (CLI Flags)

Evaluate the physics of a workload on a specific hardware node instantly: mlsysim eval Llama3_8B H100 --batch-size 32

3. Deep Simulation (Infrastructure as Code)

Define your entire cluster and SLA constraints in a declarative mlsys.yaml file:

# example_cluster.yaml
version: "1.0"
workload:
  name: "Llama3_70B"
  batch_size: 4096
hardware:
  name: "H100"
  nodes: 64
ops:
  region: "Quebec"
  duration_days: 14.0
constraints:
  assert:
    - metric: "performance.latency"
      max: 50.0

Then compile and evaluate the 3-lens scorecard (Feasibility, Performance, Macro): mlsysim eval example_cluster.yaml

4. CI/CD & Agentic Automation

Every command supports strict, schema-validated JSON output. If an assert constraint is violated, the CLI returns a semantic Exit Code 3.

# Export the JSON Schema for your IDE or AI Agent
mlsysim schema > schema.json

# Run an evaluation in a CI pipeline
tco=$(mlsysim --output json eval example_cluster.yaml | jq .macro.metrics.tco_usd)

5. Design Space Search (Optimizers)

Use the Tier 3 Engineering Engine to automatically find the optimal configuration: mlsysim optimize parallelism example_cluster.yaml
mlsysim optimize placement example_cluster.yaml --carbon-tax 150


🛡 Stability & Integrity

Because this core powers a printed textbook, we enforce strict Invariant Verification. Every physical constant is traceable to a primary source (datasheet or paper), and dimensional integrity is enforced via pint.

⚠️ What This Tool Does Not Model

MLSysim is an analytical hardware calculator, not a production deployment simulator. The 22 walls model physical and economic constraints that bound ML system performance. Several critical production concerns are deliberately out of scope:

Concern Why it matters Where to learn more
Data drift / distribution shiftThe #1 cause of production ML failures — model accuracy degrades silently as input distributions changeSculley et al. (2015), "Hidden Technical Debt in ML Systems"
Model versioning & rollbackProduction requires running multiple versions, A/B testing, and safe rollbackHuyen (2022), Designing Machine Learning Systems
Monitoring & observabilityYou cannot manage what you cannot measure — prediction distributions, latency percentiles, error ratesGoogle SRE Book (2016); Huyen (2022)
Feature store freshnessStale features silently degrade real-time models (recommendations, fraud detection)Uber Michelangelo (2017)
Software bugs & misconfigurationsMost outages are caused by software, not hardwareBarroso et al. (2018)
Human factorsTeam velocity, on-call burden, and organizational alignment often dominate outcomesBrooks (1975), The Mythical Man-Month

Passing all 22 walls is necessary but not sufficient for a successful production deployment.

Students using this tool should understand that infrastructure physics (what mlsysim models) is one dimension of a multi-dimensional engineering challenge.

📖 How to Cite

If you use mlsysim in your research or teaching, please cite:

@software{mlsysim2026,
  author       = {Janapa Reddi, Vijay},
  title        = {{MLSysim}: A Composable Analytical Framework for Machine Learning Systems},
  year         = {2026},
  url          = {https://mlsysbook.ai/mlsysim},
  version      = {0.1.0},
  institution  = {Harvard University}
}

🛠 Installation

MLSys·im is designed to be highly modular. Install only what you need:

# Core physics engine only (fastest, smallest footprint)
pip install mlsysim

# Install with the beautiful Terminal UI & YAML support
pip install "mlsysim[cli]"

# Install with dependencies for interactive labs (Marimo, Plotly)
pip install "mlsysim[labs]"

🐍 Python API Usage

The framework is just as powerful inside a Python script or Jupyter Notebook. The SystemEvaluator provides a clean, unified entry point for full-stack analysis:

import mlsysim

# 1. Define the scenario
model = mlsysim.Models.Language.Llama3_8B
hardware = mlsysim.Hardware.Cloud.H100

# 2. Run the evaluation
evaluation = mlsysim.SystemEvaluator.evaluate(
    scenario_name="Llama-3 8B on H100",
    model_obj=model,
    hardware_obj=hardware,
    batch_size=32,
    precision="fp16",
    efficiency=0.45
)

# 3. View the beautifully formatted scorecard
print(evaluation.scorecard())

Efficiency Parameter Guide

The efficiency parameter (0.0–1.0) captures the gap between peak hardware performance and what your software stack actually achieves. Use these guidelines:

Scenario Efficiency Rationale
Training (Megatron-LM, large Transformer)0.40–0.55Well-optimized GEMM + FlashAttention
Training (PyTorch eager, small model)0.08–0.15Kernel launch overhead dominates
Inference decode, batch=10.01–0.05Memory-bound; compute nearly idle
Inference decode, batch=32+0.15–0.35Batch amortizes weight loading
Inference prefill, long context0.30–0.50Compute-bound GEMM + attention
TinyML (TFLite Micro on ESP32)0.05–0.15Interpreter overhead, no tensor cores

Contributors

Thanks to these wonderful people for helping improve MLSys·im!

Legend: 🪲 Bug Hunter · ⚡ Code Warrior · 📚 Documentation Hero · 🎨 Design Artist · 🧠 Idea Generator · 🔎 Code Reviewer · 🧪 Test Engineer · 🛠️ Tool Builder

Vijay Janapa Reddi
Vijay Janapa Reddi

🧑‍💻 🎨 ✍️ 🧠 maintenance
Peter Koellner
Peter Koellner

🪲 ✍️
Zeljko Hrcek
Zeljko Hrcek

🧑‍💻
Rocky
Rocky

🧑‍💻

Recognize a contributor: Comment on any issue or PR:

@all-contributors please add @username for code, doc, ideas, or bug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlsysim-0.1.0.tar.gz (142.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlsysim-0.1.0-py3-none-any.whl (163.2 kB view details)

Uploaded Python 3

File details

Details for the file mlsysim-0.1.0.tar.gz.

File metadata

  • Download URL: mlsysim-0.1.0.tar.gz
  • Upload date:
  • Size: 142.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mlsysim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 602a210749d398c869c17dad7725acf82d27ac6c0abcd8e5580db9c8c00e6c60
MD5 bf1e2d09e1a4b8767a2a36171eadbffd
BLAKE2b-256 d0679480fe989cad002a373fc4a6b8f5c824009cc3a429a9899f5631d61a4165

See more details on using hashes here.

File details

Details for the file mlsysim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlsysim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 163.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mlsysim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94d3c9409f12ebf7571547209807e401d9ca8435f83f751819a552280013f752
MD5 5ba26b3d8a453686fadb5ea633187ff8
BLAKE2b-256 feca81e0282624a3fc29e4147d078c80a827cbeafc1504469afba3347b0235d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page