Model-aware inference memory-placement planner for single-GPU rigs — profile, plan, prove.
Project description
日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)
A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe.
Run the largest useful local model your machine can honestly support, with explicit placement plans, benchmark receipts, and refusal when the plan would thrash.
Architecture
Windows / WSL2 / Linux host
└─ GPU-enabled Docker container
└─ Inference runtime
├─ VRAM: hot weights, active layers, activations, KV working set
├─ pinned RAM: CPU-offloaded weights, MoE experts, KV spill/reuse
└─ NVMe: mmap shards, disk offload, cold experts, cold KV
Product Boundary
Docker = packaging + GPU exposure
CUDA/runtime = compute backend
Planner = memory law
Inference engine = execution
Core Features
- Hardware profiler — Detect VRAM, RAM, GPU type, WSL/native Linux, NVMe speed, CUDA availability
- Model profiler — Detect dense vs MoE, largest layer, total weights, quantization, KV growth by context length
- Runtime planner — Generate launch plans for llama.cpp, vLLM, Accelerate, TensorRT-LLM, or DeepSpeed-style offload
- Placement receipt — Show what is in VRAM, what is in RAM, what is on disk, expected bottleneck, measured tokens/sec
- MoE-specialized path — Keep always-active layers on GPU, route experts to CPU/RAM, NVMe for cold fallback
- Routing de-risk — Measure whether a model's MoE routing is skewed enough that a per-expert cache would help, before building for it (
gpu-container-concentration) - Rig-safety watchdog — Poll GPU power/temperature/VRAM + host memory against configurable thresholds; an AI agent or an autonomous loop aborts a run before it endangers the machine (
gpu-container-watchdog)
Key Constraint
On Windows/WSL, CUDA Unified Memory oversubscription is not the path. CUDA treats Windows/WSL as limited unified-memory support — no fine-grained GPU page-fault migration, no GPU-memory oversubscription beyond physical VRAM. This product is explicit inference memory placement, not "Docker VRAM overflow."
Status
Built and working today: gpu-container-profile, gpu-container-plan, gpu-container-receipt (with the recalibration loop), gpu-container-concentration (routing de-risk), and gpu-container-watchdog (supervise a GPU job safely). llama.cpp is the integrated backend; the placement math is backend-agnostic. Start with the quickstart.
Privacy & safety
gpu-container is a local, offline tool — it makes no network calls and collects no telemetry, by default or otherwise. It reads GPU metrics (nvidia-smi / NVML) and host memory (psutil), the model config.json you supply, and the JSON files you point it at; it writes only to the output paths you specify. It does not read or transmit model weights, credentials, or tokens. Host-level actions (wsl --shutdown, docker stop, kill) run only when you explicitly opt in via the watchdog's --on-breach; the defaults never touch your machine beyond the job they supervise. Full policy: SECURITY.md.
Documentation
docs/quickstart.md— end-to-end walkthrough: profile → plan → launch under the watchdog → receipt → recalibratedocs/cli.md— the five commands: synopsis, flags, exit codes, worked examplesdocs/architecture.md— memory-tier model, data flow, MoE expert routing, the recalibration loopdocs/features.md— the seven core features in depthdocs/moe-lane-architecture.md— the flagship MoE lane in depthdocs/derisk-concentration.md— the per-expert-cache de-risk gate (routing concentration)docs/decisions/0001-per-expert-cache-build-vs-upstream.md— ADR-0001: consume the cache mechanism, contribute the policydocs/constraints.md— non-goals + the Windows/WSL CUDA Unified-Memory correctiondocs/prior-art.md— runtimes we orchestrate, and the gap this product fillsdocs/feasibility.md— feasibility assessment, research grounding, and what's confirmed live
Built by MCP Tool Shop · MIT Licensed
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpu_container-0.1.3.tar.gz.
File metadata
- Download URL: gpu_container-0.1.3.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2267f0b127b040a04c2bb783b1302754c584339e27c33473116a13b7a75f081f
|
|
| MD5 |
4fc3d1acc7e772702b4fa768c98d4325
|
|
| BLAKE2b-256 |
fa065c2b1f98b1537a22a3f6990a8db6d89af0649134ff74d071cbb30209c879
|
Provenance
The following attestation bundles were made for gpu_container-0.1.3.tar.gz:
Publisher:
release.yml on mcp-tool-shop-org/gpu-container
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_container-0.1.3.tar.gz -
Subject digest:
2267f0b127b040a04c2bb783b1302754c584339e27c33473116a13b7a75f081f - Sigstore transparency entry: 1724724957
- Sigstore integration time:
-
Permalink:
mcp-tool-shop-org/gpu-container@d16a0229b78b2e8dbeebd8a0f3280bc440fb1ad5 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/mcp-tool-shop-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d16a0229b78b2e8dbeebd8a0f3280bc440fb1ad5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gpu_container-0.1.3-py3-none-any.whl.
File metadata
- Download URL: gpu_container-0.1.3-py3-none-any.whl
- Upload date:
- Size: 65.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c546471c3af00c36c30daca9d20c0ff2564dd8956014e8a0eaab94b7998cb021
|
|
| MD5 |
c60074dd231923e971c2e8994521b971
|
|
| BLAKE2b-256 |
cd11635b0294168e4409410affdd9416ea241b1aa9e4564d1ad48c809eb953ce
|
Provenance
The following attestation bundles were made for gpu_container-0.1.3-py3-none-any.whl:
Publisher:
release.yml on mcp-tool-shop-org/gpu-container
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_container-0.1.3-py3-none-any.whl -
Subject digest:
c546471c3af00c36c30daca9d20c0ff2564dd8956014e8a0eaab94b7998cb021 - Sigstore transparency entry: 1724725055
- Sigstore integration time:
-
Permalink:
mcp-tool-shop-org/gpu-container@d16a0229b78b2e8dbeebd8a0f3280bc440fb1ad5 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/mcp-tool-shop-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d16a0229b78b2e8dbeebd8a0f3280bc440fb1ad5 -
Trigger Event:
release
-
Statement type: