Skip to main content

Model-aware inference memory-placement planner for single-GPU rigs — profile, plan, prove.

Project description

日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)

gpu-container

CI PyPI npm License: MIT Handbook

A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe.

Run the largest useful local model your machine can honestly support, with explicit placement plans, benchmark receipts, and refusal when the plan would thrash.

Architecture

Windows / WSL2 / Linux host
  └─ GPU-enabled Docker container
      └─ Inference runtime
          ├─ VRAM: hot weights, active layers, activations, KV working set
          ├─ pinned RAM: CPU-offloaded weights, MoE experts, KV spill/reuse
          └─ NVMe: mmap shards, disk offload, cold experts, cold KV

Product Boundary

Docker         = packaging + GPU exposure
CUDA/runtime   = compute backend
Planner        = memory law
Inference engine = execution

Core Features

  1. Hardware profiler — Detect VRAM, RAM, GPU type, WSL/native Linux, NVMe speed, CUDA availability
  2. Model profiler — Detect dense vs MoE, largest layer, total weights, quantization, KV growth by context length
  3. Runtime planner — Generate launch plans for llama.cpp, vLLM, Accelerate, TensorRT-LLM, or DeepSpeed-style offload
  4. Placement receipt — Show what is in VRAM, what is in RAM, what is on disk, expected bottleneck, measured tokens/sec
  5. MoE-specialized path — Keep always-active layers on GPU, route experts to CPU/RAM, NVMe for cold fallback
  6. Routing de-risk — Measure whether a model's MoE routing is skewed enough that a per-expert cache would help, before building for it (gpu-container-concentration)
  7. Rig-safety watchdog — Poll GPU power/temperature/VRAM + host memory against configurable thresholds; an AI agent or an autonomous loop aborts a run before it endangers the machine (gpu-container-watchdog)

Key Constraint

On Windows/WSL, CUDA Unified Memory oversubscription is not the path. CUDA treats Windows/WSL as limited unified-memory support — no fine-grained GPU page-fault migration, no GPU-memory oversubscription beyond physical VRAM. This product is explicit inference memory placement, not "Docker VRAM overflow."

Status

Built and working today: gpu-container-profile, gpu-container-plan, gpu-container-receipt (with the recalibration loop), gpu-container-concentration (routing de-risk), and gpu-container-watchdog (supervise a GPU job safely). llama.cpp is the integrated backend; the placement math is backend-agnostic. Start with the quickstart.

Privacy & safety

gpu-container is a local, offline tool — it makes no network calls and collects no telemetry, by default or otherwise. It reads GPU metrics (nvidia-smi / NVML) and host memory (psutil), the model config.json you supply, and the JSON files you point it at; it writes only to the output paths you specify. It does not read or transmit model weights, credentials, or tokens. Host-level actions (wsl --shutdown, docker stop, kill) run only when you explicitly opt in via the watchdog's --on-breach; the defaults never touch your machine beyond the job they supervise. Full policy: SECURITY.md.

Documentation


Built by MCP Tool Shop · MIT Licensed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_container-0.1.2.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_container-0.1.2-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file gpu_container-0.1.2.tar.gz.

File metadata

  • Download URL: gpu_container-0.1.2.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7149d77c1561f8b6c1a71aa0b67244272382b9f9816f21819b0cda65e6bae062
MD5 74002826419f130d257540be63006f96
BLAKE2b-256 d62a87f056b7e51e4d1a6e4ee0147ae27caae129b2aaacb270c20281baba6d92

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.2.tar.gz:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_container-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gpu_container-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpu_container-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f41f2fb67dcddd7e4fd520fd8af7cceb1b355ab3f5828931b02f8246dfaf6499
MD5 eb0b0bf261f5188a75f70db857abe823
BLAKE2b-256 daa313b82c805de31a00c4d30908742c40c14a8763bcf0262580500dbf7a865c

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_container-0.1.2-py3-none-any.whl:

Publisher: release.yml on mcp-tool-shop-org/gpu-container

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page