Skip to main content

Model-aware inference memory-placement planner for single-GPU rigs — profile, plan, prove.

Project description

日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)

gpu-container

CI PyPI npm License: MIT Handbook

A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe.

Run the largest useful local model your machine can honestly support, with explicit placement plans, benchmark receipts, and refusal when the plan would thrash.

Architecture

Windows / WSL2 / Linux host
  └─ GPU-enabled Docker container
      └─ Inference runtime
          ├─ VRAM: hot weights, active layers, activations, KV working set
          ├─ pinned RAM: CPU-offloaded weights, MoE experts, KV spill/reuse
          └─ NVMe: mmap shards, disk offload, cold experts, cold KV

Product Boundary

Docker         = packaging + GPU exposure
CUDA/runtime   = compute backend
Planner        = memory law
Inference engine = execution

Core Features

  1. Hardware profiler — Detect VRAM, RAM, GPU type, WSL/native Linux, NVMe speed, CUDA availability
  2. Model profiler — Detect dense vs MoE, largest layer, total weights, quantization, KV growth by context length
  3. Runtime planner — Generate launch plans for llama.cpp, vLLM, Accelerate, TensorRT-LLM, or DeepSpeed-style offload
  4. Placement receipt — Show what is in VRAM, what is in RAM, what is on disk, expected bottleneck, measured tokens/sec
  5. MoE-specialized path — Keep always-active layers on GPU, route experts to CPU/RAM, NVMe for cold fallback
  6. Routing de-risk — Measure whether a model's MoE routing is skewed enough that a per-expert cache would help, before building for it (gpu-container-concentration)
  7. Rig-safety watchdog — Poll GPU power/temperature/VRAM + host memory against configurable thresholds; an AI agent or an autonomous loop aborts a run before it endangers the machine (gpu-container-watchdog)

Key Constraint

On Windows/WSL, CUDA Unified Memory oversubscription is not the path. CUDA treats Windows/WSL as limited unified-memory support — no fine-grained GPU page-fault migration, no GPU-memory oversubscription beyond physical VRAM. This product is explicit inference memory placement, not "Docker VRAM overflow."

Status

Built and working today: gpu-container-profile, gpu-container-plan, gpu-container-receipt (with the recalibration loop), gpu-container-concentration (routing de-risk), and gpu-container-watchdog (supervise a GPU job safely). llama.cpp is the integrated backend; the placement math is backend-agnostic. Start with the quickstart.

Privacy & safety

gpu-container is a local, offline tool — it makes no network calls and collects no telemetry, by default or otherwise. It reads GPU metrics (nvidia-smi / NVML) and host memory (psutil), the model config.json you supply, and the JSON files you point it at; it writes only to the output paths you specify. It does not read or transmit model weights, credentials, or tokens. Host-level actions (wsl --shutdown, docker stop, kill) run only when you explicitly opt in via the watchdog's --on-breach; the defaults never touch your machine beyond the job they supervise. Full policy: SECURITY.md.

Documentation


Built by MCP Tool Shop · MIT Licensed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_container-0.1.3.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_container-0.1.3-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file gpu_container-0.1.3.tar.gz.

File metadata

  • Download URL: gpu_container-0.1.3.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2267f0b127b040a04c2bb783b1302754c584339e27c33473116a13b7a75f081f
MD5 4fc3d1acc7e772702b4fa768c98d4325
BLAKE2b-256 fa065c2b1f98b1537a22a3f6990a8db6d89af0649134ff74d071cbb30209c879

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.3.tar.gz:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_container-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: gpu_container-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c546471c3af00c36c30daca9d20c0ff2564dd8956014e8a0eaab94b7998cb021
MD5 c60074dd231923e971c2e8994521b971
BLAKE2b-256 cd11635b0294168e4409410affdd9416ea241b1aa9e4564d1ad48c809eb953ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.3-py3-none-any.whl:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page