Heterogeneous compute router — auto-detect CUDA, iGPU, CPU, NPU and route ML workloads optimally
Project description
device-router
Heterogeneous compute router — auto-detect CUDA, iGPU, CPU, NPU and route ML workloads optimally.
Modern laptops and workstations have multiple compute units: a discrete GPU (CUDA), an integrated GPU (iGPU/DirectML), a Neural Processing Unit (NPU), and the CPU. Most ML frameworks pick one device and stick with it. That's wasteful.
device-router detects what's available and routes each workload to the best device automatically.
Why it matters
| Workload | Best device | Why |
|---|---|---|
| Single embedding | CPU | No GPU transfer overhead (~9μs) |
| Small model (int8) | CPU (VNNI) | CPU has dedicated VNNI instructions |
| Medium model batched | iGPU | Good compute, low power |
| Large model training | CUDA GPU | Parallelism + AMP |
| ONNX inference | CPU | ONNX Runtime is CPU-optimized |
Install
pip install device-router
Optional dependencies:
pip install device-router[cuda] # CUDA GPU detection via torch
pip install device-router[directml] # iGPU detection via torch-directml
pip install device-router[all] # Everything
pip install device-router[dev] # pytest + numpy for development
Quick start
from device_router import DeviceRouter, RoutingStrategy
router = DeviceRouter()
router.detect() # Finds CUDA, DirectML, CPU features, NPU
# Route a workload
decision = router.route(
model_size=1_000_000, # parameters
batch_size=32,
precision="fp32", # or "fp16", "bf16", "int8"
strategy=RoutingStrategy.AUTO,
)
print(f"Use {decision.device} ({decision.reason})")
# → Use cuda (Medium/large model (1,000,000 params) — GPU recommended)
# System overview
overview = router.overview()
# Returns: {cuda: {...}, cpu: {...}, igpu: {...}, npu: {...}}
Routing strategies
| Strategy | Description | Use case |
|---|---|---|
AUTO |
Best guess based on model size & batch | Default |
LATENCY |
Optimize for single-sample speed | Real-time inference |
THROUGHPUT |
Optimize for batch processing | Batch jobs |
POWER |
Prefer CPU/iGPU for efficiency | Laptops, mobile |
How it works
Without any dependencies
device-router runs pure CPU detection:
- CPU architecture, core count, frequency
- Instruction set features (AVX, AVX2, AVX-512, VNNI, AMX, NEON, SSE4)
- This is enough to route small models optimally
With torch installed
Adds CUDA detection:
- GPU count, name, VRAM, compute capability
- CUDA/cuDNN version
- Enables AMP and GPU benchmarking
With torch-directml installed
Adds iGPU detection:
- DirectML device availability
- Enables iGPU offloading for medium workloads
Routing decision logic
ONNX model → CPU (always)
Training → CUDA (if available) or CPU
Small model (<100K params) → CPU
+ int8 + VNNI → CPU with VNNI optimization
Medium model (100K-10M) → CUDA > DirectML > CPU
Large model (>10M) + batched → CUDA with AMP
API
DeviceRouter
router = DeviceRouter()
router.detect() # Scan for devices
router.overview() # Get system overview
router.route(model_size, batch_size, precision, strategy) # Route workload
router.assign("cuda") # Get torch.device for device string
RoutingDecision
decision.device # "cuda", "cpu", "directml", "npu"
decision.reason # Human-readable explanation
decision.precision # Recommended precision
decision.use_amp # Whether to use mixed precision
decision.confidence # Confidence (0-1)
SuperInstance Mesh integration
# entry_point: superinstance.plugins
def register_device_router(registry):
from device_router import DeviceRouter
registry.register("devices", "router", DeviceRouter)
Running tests
pip install -e ".[dev]"
pytest tests/ -v
License
MIT
Ecosystem
Part of the SuperInstance ecosystem:
| Package | Description |
|---|---|
| plato-core | Base types + mesh registry |
| tensor-spline | SplineLinear neural compression |
| eisenstein-embed | 5-layer matching cascade |
| plato-training | Training monolith |
| device-router | Heterogeneous compute routing |
| triplet-miner | Git-powered contrastive data |
| micro-onnx | ONNX export + benchmark |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file device_router-0.1.0.tar.gz.
File metadata
- Download URL: device_router-0.1.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfea39391aeefdfe6814293ac45115dd65ba336f2c1a255939396c6230730fce
|
|
| MD5 |
a5db98c5b5e825a49f2d522c6a20d1e9
|
|
| BLAKE2b-256 |
e5f8013369fa357a576cb8ab7bff4cbaf6e53162dcc9d7fcad4beb272db645d4
|
File details
Details for the file device_router-0.1.0-py3-none-any.whl.
File metadata
- Download URL: device_router-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a08fbb97a4e6c025b0a32df275cf58ac3cfff59549a4de8f3481c2919f281ef
|
|
| MD5 |
1533102de872e05531cca44a6e02086f
|
|
| BLAKE2b-256 |
ec2cd5e040f7b767340e00653c784848dbb88586e3a25d90c869eedbfee9b7d2
|