Skip to main content

Model-aware inference memory-placement planner for single-GPU rigs — profile, plan, prove.

Project description

日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)

gpu-container

CI PyPI npm License: MIT Handbook

A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe.

Run the largest useful local model your machine can honestly support, with explicit placement plans, benchmark receipts, and refusal when the plan would thrash.

Architecture

Windows / WSL2 / Linux host
  └─ GPU-enabled Docker container
      └─ Inference runtime
          ├─ VRAM: hot weights, active layers, activations, KV working set
          ├─ pinned RAM: CPU-offloaded weights, MoE experts, KV spill/reuse
          └─ NVMe: mmap shards, disk offload, cold experts, cold KV

Product Boundary

Docker         = packaging + GPU exposure
CUDA/runtime   = compute backend
Planner        = memory law
Inference engine = execution

Core Features

  1. Hardware profiler — Detect VRAM, RAM, GPU type, WSL/native Linux, NVMe speed, CUDA availability
  2. Model profiler — Detect dense vs MoE, largest layer, total weights, quantization, KV growth by context length
  3. Runtime planner — Generate launch plans for llama.cpp, vLLM, Accelerate, TensorRT-LLM, or DeepSpeed-style offload
  4. Placement receipt — Show what is in VRAM, what is in RAM, what is on disk, expected bottleneck, measured tokens/sec
  5. MoE-specialized path — Keep always-active layers on GPU, route experts to CPU/RAM, NVMe for cold fallback
  6. Routing de-risk — Measure whether a model's MoE routing is skewed enough that a per-expert cache would help, before building for it (gpu-container-concentration)
  7. Rig-safety watchdog — Poll GPU power/temperature/VRAM + host memory against configurable thresholds; an AI agent or an autonomous loop aborts a run before it endangers the machine (gpu-container-watchdog)

Key Constraint

On Windows/WSL, CUDA Unified Memory oversubscription is not the path. CUDA treats Windows/WSL as limited unified-memory support — no fine-grained GPU page-fault migration, no GPU-memory oversubscription beyond physical VRAM. This product is explicit inference memory placement, not "Docker VRAM overflow."

Status

Built and working today: gpu-container-profile, gpu-container-plan, gpu-container-receipt (with the recalibration loop), gpu-container-concentration (routing de-risk), and gpu-container-watchdog (supervise a GPU job safely). llama.cpp is the integrated backend; the placement math is backend-agnostic. Start with the quickstart.

Privacy & safety

gpu-container is a local, offline tool — it makes no network calls and collects no telemetry, by default or otherwise. It reads GPU metrics (nvidia-smi / NVML) and host memory (psutil), the model config.json you supply, and the JSON files you point it at; it writes only to the output paths you specify. It does not read or transmit model weights, credentials, or tokens. Host-level actions (wsl --shutdown, docker stop, kill) run only when you explicitly opt in via the watchdog's --on-breach; the defaults never touch your machine beyond the job they supervise. Full policy: SECURITY.md.

Documentation


Built by MCP Tool Shop · MIT Licensed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_container-0.1.1.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_container-0.1.1-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file gpu_container-0.1.1.tar.gz.

File metadata

  • Download URL: gpu_container-0.1.1.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5219c46bbc64222b797b4bb72615c79c686817129d1ea5dc72b28a92a21e12b3
MD5 19588469cbb160b7f5d3467a18fe6c56
BLAKE2b-256 534aa895651adfc33248f911bd4624ce967e5458359925453c3cff05b713a32f

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.1.tar.gz:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_container-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gpu_container-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8cc9995aacc5a07c3c2a43ed3b0faa74c3e64b8b802eb055dd35a62498a4f67
MD5 3b9157e41076ef1ec0712b2ff45b25b2
BLAKE2b-256 ee9f1517ecfb0232940aa4dbcf066b980d444e79dca76082f6fb8b5a4cb926c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.1-py3-none-any.whl:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page