Skip to main content

Model-aware inference memory-placement planner for single-GPU rigs — profile, plan, prove.

Project description

日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)

gpu-container

CI PyPI npm License: MIT Handbook

A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe.

Run the largest useful local model your machine can honestly support, with explicit placement plans, benchmark receipts, and refusal when the plan would thrash.

Architecture

Windows / WSL2 / Linux host
  └─ GPU-enabled Docker container
      └─ Inference runtime
          ├─ VRAM: hot weights, active layers, activations, KV working set
          ├─ pinned RAM: CPU-offloaded weights, MoE experts, KV spill/reuse
          └─ NVMe: mmap shards, disk offload, cold experts, cold KV

Product Boundary

Docker         = packaging + GPU exposure
CUDA/runtime   = compute backend
Planner        = memory law
Inference engine = execution

Core Features

  1. Hardware profiler — Detect VRAM, RAM, GPU type, WSL/native Linux, NVMe speed, CUDA availability
  2. Model profiler — Detect dense vs MoE, largest layer, total weights, quantization, KV growth by context length
  3. Runtime planner — Generate launch plans for llama.cpp, vLLM, Accelerate, TensorRT-LLM, or DeepSpeed-style offload
  4. Placement receipt — Show what is in VRAM, what is in RAM, what is on disk, expected bottleneck, measured tokens/sec
  5. MoE-specialized path — Keep always-active layers on GPU, route experts to CPU/RAM, NVMe for cold fallback
  6. Routing de-risk — Measure whether a model's MoE routing is skewed enough that a per-expert cache would help, before building for it (gpu-container-concentration)
  7. Rig-safety watchdog — Poll GPU power/temperature/VRAM + host memory against configurable thresholds; an AI agent or an autonomous loop aborts a run before it endangers the machine (gpu-container-watchdog)

Key Constraint

On Windows/WSL, CUDA Unified Memory oversubscription is not the path. CUDA treats Windows/WSL as limited unified-memory support — no fine-grained GPU page-fault migration, no GPU-memory oversubscription beyond physical VRAM. This product is explicit inference memory placement, not "Docker VRAM overflow."

Status

Built and working today: gpu-container-profile, gpu-container-plan, gpu-container-receipt (with the recalibration loop), gpu-container-concentration (routing de-risk), and gpu-container-watchdog (supervise a GPU job safely). llama.cpp is the integrated backend; the placement math is backend-agnostic. Start with the quickstart.

Privacy & safety

gpu-container is a local, offline tool — it makes no network calls and collects no telemetry, by default or otherwise. It reads GPU metrics (nvidia-smi / NVML) and host memory (psutil), the model config.json you supply, and the JSON files you point it at; it writes only to the output paths you specify. It does not read or transmit model weights, credentials, or tokens. Host-level actions (wsl --shutdown, docker stop, kill) run only when you explicitly opt in via the watchdog's --on-breach; the defaults never touch your machine beyond the job they supervise. Full policy: SECURITY.md.

Documentation


Built by MCP Tool Shop · MIT Licensed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_container-0.1.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_container-0.1.0-py3-none-any.whl (65.2 kB view details)

Uploaded Python 3

File details

Details for the file gpu_container-0.1.0.tar.gz.

File metadata

  • Download URL: gpu_container-0.1.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b1331aac3f7a00ce6991eade97cd396de862567361b992c6d962b8cf7ec2a559
MD5 d2e205cf28340d84b03d4e19e1fc0745
BLAKE2b-256 37484794932ade86c05d05102f9a978c8a101260c3d5ad706de3bddc1e433856

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.0.tar.gz:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_container-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gpu_container-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 65.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65efd372c63271dbee6c0daa918b2fcf6c61f1a8d14c138036db6f4422367f6d
MD5 829bceb43e1c2760b8070900662caf12
BLAKE2b-256 357a9511b3c759073172cfd0c892ffdc9d95f6178ab00ea5e66d9b0a9cfefe6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.0-py3-none-any.whl:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page