Skip to main content

MCP server for datacenter GPU liquid cooling thermal analysis

Project description

CI PyPI Python 3.10+

thermal-mcp-server

A physics engine for liquid-cooled GPU systems, exposed as an AI-callable MCP server. Ask Claude to size a CDU for an NVL72 rack, optimize flow rates for an H100 cluster, or compare water versus glycol — and get first-principles answers backed by hand-validated thermal models.

Why This Exists

GPU power density has grown roughly 3× in two generations: H100 SXM at 700 W, B200 at 1,200 W. A single NVL72 rack dissipates 86.4 kW — more than most commercial HVAC systems handle. Liquid cooling is no longer optional; it is the critical path for AI compute density, and it is now the infrastructure constraint that determines whether a data center can host the next generation of hardware.

The tooling hasn't kept up. CDU sizing decisions get made with vendor charts, spreadsheets, and intuition. thermal-mcp-server is a first-principles thermal resistance model that an AI assistant can call directly — validated against published chip specs, honest about its assumptions, and designed to answer the questions that actually matter in a procurement conversation.

Demo

Ask Claude natural-language questions. The server handles the physics.

Claude answering a liquid cooling question using thermal-mcp-server

NVL72 CDU sizing — the procurement question

The question: What's the minimum CDU flow rate to keep 72 B200 GPUs below 75°C junction temperature at 1,200 W each, with 25°C supply water?

from thermal_mcp_server.physics import optimize_flow, analyze_rack
from thermal_mcp_server.schemas import OptimizeFlowRateInput, AnalyzeRackInput, Geometry

# B200-specific cold plate geometry (engineering estimate; NVIDIA does not publish)
# 60 × 0.7 mm × 1.5 mm channels, 100 mm length, 160 cm² contact area
B200_PLATE = Geometry(
    channel_count=60, channel_width_m=0.7e-3, channel_height_m=1.5e-3,
    channel_length_m=0.10, base_thickness_m=1.5e-3,
    contact_area_m2=0.016, copper_k_w_mk=385.0,
)

# Find minimum flow per cold plate
opt = optimize_flow(OptimizeFlowRateInput(
    heat_load_w=1200, max_junction_temp_c=75.0,
    inlet_temp_c=25.0, coolant="water",
    r_jc_k_per_w=0.02, r_tim_k_per_w=0.015,  # B200 estimates
    geometry=B200_PLATE,
))
min_flow_per_gpu, analysis = opt
print(f"Minimum flow: {min_flow_per_gpu:.1f} LPM/GPU → {min_flow_per_gpu * 72:.0f} LPM total rack flow")

# Full rack spec at that flow
rack = analyze_rack(AnalyzeRackInput(
    gpu_count=72, topology="parallel",
    heat_load_per_gpu_w=1200, total_flow_lpm=min_flow_per_gpu * 72,
    cdu_supply_temp_c=25.0, coolant="water",
    r_jc_k_per_w=0.02, r_tim_k_per_w=0.015,
    geometry=B200_PLATE,
))
Minimum flow: 9.3 LPM/GPU → 671 LPM total rack flow

Rack spec at minimum flow:
  Max junction temp:   75.0°C  (at 75°C design limit)
  Cold plate ΔP:       0.096 bar  (cold plates only; add 20–50% for manifold)
  CDU return temp:     26.9°C
  Pump power:          214 W
  Total heat load:     86.4 kW

What this tells a procurement team: Minimum CDU flow spec is 671 LPM. Cold plate ΔP headroom is tight at 0.096 bar — real system ΔP with manifold and headers will be higher. With a 3°C design margin (target 72°C), the spec rises to 777 LPM at 0.141 bar.

H100 SXM — validated baseline

from thermal_mcp_server.physics import analyze
from thermal_mcp_server.schemas import AnalyzeColdplateInput

result = analyze(AnalyzeColdplateInput(
    heat_load_w=700, flow_rate_lpm=8.0, inlet_temp_c=25.0, coolant="water"
))
# junction_temp_c: 70.9  —  12.1°C margin below 83°C throttle onset
# regime: transitional  (Re ≈ 3734)
# pressure_drop_pa: 16800  (0.17 bar per cold plate)

This case is hand-calculation validated: every intermediate value — Reynolds number, Nusselt number, convection coefficient, pressure drop — is independently verified in tests/test_physics_behavior.py.

How It Works

The physics engine models a cold plate as a 1D thermal resistance network:

T_junction = T_inlet + Q × (R_jc + R_tim + R_base + R_conv) + ΔT_coolant/2
  • R_jc / R_tim: Package resistances (chip manufacturer spec or estimate)
  • R_base: Copper base conduction (geometry + k = 385 W/m·K)
  • R_conv: Forced convection, Dittus-Boelter (turbulent) or Nu = 4.36 (laminar), linearly blended through transition (Re 2,300–4,000)
  • ΔP: Darcy-Weisbach with Blasius friction factor, same transition blend

Rack-level model stacks N single-GPU analyses in series (cumulative temperature rise) or parallel (uniform inlet, flow split) topology.

flowchart LR
    A["Input\nchip power, flow,\ncoolant, geometry"] --> B["Physics Engine\nDittus-Boelter · Darcy-Weisbach\nR_total network"]
    B --> C["Output\nT_junction · ΔP\nthermal margin · pump power"]

Validation

Model outputs against published chip specs. All runs use water coolant, 25°C inlet.

Chip TDP Tj Design Ceiling Model Tj Margin Notes
H100 SXM 700 W 83°C 70.9°C at 8 LPM 12.1°C Default geometry; hand-calc validated
MI300X 750 W ~85°C (proxy) 74.2°C at 8 LPM ~10°C AMD does not publish Tj_max
B200 NVL72 1,200 W ~75°C (est.) 75.0°C at 9.3 LPM/GPU 0°C at limit R_jc=0.02 K/W est.; NVIDIA does not publish
Gaudi 3 OAM 900 W (air) / 1,200 W (liquid) ~85°C (proxy) Requires B200-class geometry Default H100 geometry undersized for 1,200 W

On B200 and Gaudi 3 numbers: NVIDIA and Intel do not publish cold plate geometry or R_jc for these chips. The B200 analysis uses an engineering estimate for cold plate geometry and package resistance. Treat as indicative; real procurement sizing requires vendor data.

Chip sources: NVIDIA H100 Datasheet · NVIDIA GB200 NVL72 · SemiAnalysis B200 thermal estimates · AMD MI300X Data Sheet · Intel Gaudi 3 Product Brief

Known Limitations

These are documented explicitly because they bound what the model can and cannot tell you:

  • No manifold or header pressure losses — rack ΔP is cold-plate-only. Real system ΔP should add 20–50% for manifold losses. Do not use cold-plate ΔP as the CDU pump spec without this adder.
  • No heterogeneous racks — all GPUs assumed identical TDP, geometry, and thermal resistance. Mixed-SKU racks require per-GPU analysis.
  • Steady-state only — no transient thermal capacitance. Power-on ramps, burst workloads, and cooldown curves are not modeled.
  • Single-point fluid properties — water and glycol50 properties are fixed at 25°C nominal. No correction for temperature rise along the flow path.
  • No flow maldistribution — uniform flow assumed across all cold plates. Real parallel manifolds have ±10–30% flow variation depending on header design.

See docs/physics.md for the full physics documentation including equations and assumptions.

Tools

  • analyze_coldplate — Single-point thermal and hydraulic analysis. Returns junction temperature, full resistance breakdown, pressure drop, regime, and pump power.

  • compare_coolants — Side-by-side water vs. 50/50 glycol comparison at identical conditions. Quantifies the thermal and hydraulic penalty of glycol for freeze-protection applications.

  • optimize_flow_rate — Binary search for minimum flow rate to meet a junction temperature target. Returns the optimal flow and full thermal analysis at that point.

  • analyze_rack — Rack-level model for N identical GPUs in series or parallel topology. Returns max junction temperature, per-GPU temperature list, total CDU flow requirement, system ΔP, pump power, and CDU return temperature.

See docs/mcp.md for full input/output schemas and field definitions.

Quick Start

Install from PyPI:

python -m venv thermal-venv
source thermal-venv/bin/activate
pip install thermal-mcp-server

Configure in your MCP client (claude_desktop_config.json):

{
  "mcpServers": {
    "thermal": {
      "command": "/absolute/path/to/thermal-venv/bin/python",
      "args": ["-m", "thermal_mcp_server"]
    }
  }
}

Use the absolute path to your venv Python. Claude Desktop does not inherit your shell's PATH.

Install from source:

git clone https://github.com/riccardovietri/thermal-mcp-server.git
cd thermal-mcp-server
python -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
pytest  # 34 tests, all should pass

Usage with Claude

Once configured, ask Claude engineering questions directly:

"I have 8 H100 SXM GPUs at 700 W each. Water cooling, 8 LPM per cold plate, 25°C CDU supply. What's the junction temperature and how much thermal margin do I have?"

"What's the minimum flow rate to keep an H100 below 80°C at 30°C inlet temperature?"

"Compare water versus 50/50 glycol for a 700 W load at 8 LPM — what's the Tj penalty of switching to glycol?"

"Size a CDU for 8 H100 GPUs in a parallel manifold. I want to know total flow, system ΔP, and return water temperature."

Claude calls the relevant tool, interprets the physics output, and answers in context. The MCP layer handles validation and error reporting; the physics stays in the Python API and is independently testable.

Roadmap

  • Sensitivity and uncertainty outputmargin_c parameter for optimize_flow_rate targets Tj_p95 rather than nominal. ∂Tj/∂Q, ∂Tj/∂R_tim, ∂Tj/∂T_inlet available via compute_sensitivity. See examples/interactive_sizing.ipynb for worked examples.
  • ROI calculator — financial layer: annual cooling cost delta between air and liquid, CDU payback period, per-GPU cooling tax as % of compute cost.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thermal_mcp_server-0.3.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thermal_mcp_server-0.3.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file thermal_mcp_server-0.3.0.tar.gz.

File metadata

  • Download URL: thermal_mcp_server-0.3.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for thermal_mcp_server-0.3.0.tar.gz
Algorithm Hash digest
SHA256 eaecf740d7a8361c1a96db0bc9cedfba6fd2442cfdca49906eb34853e9ec1237
MD5 f5a2c49effeacc632b555ec829fb2081
BLAKE2b-256 eb51917dabe4f1beb1fd5670a19afa4b07d038135d9035778bd31b1f19784905

See more details on using hashes here.

File details

Details for the file thermal_mcp_server-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thermal_mcp_server-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ec103736cade10aecb7eed664e003f4d95002cf2b2d6c7b27834d755a7dd25c
MD5 60deaaa3466291ecb7aaf54277794667
BLAKE2b-256 9cd591d2d0dabda16c47782f7d49c543ec0eb97045ccb1ecde86bf4b7086d8ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page