Skip to main content

Heterogeneous compute router — auto-detect CUDA, iGPU, CPU, NPU and route ML workloads optimally

Project description

device-router

PyPI version License: MIT Python 3.10+ Tests Status: Beta

Heterogeneous compute router — auto-detect CUDA, iGPU, CPU, NPU and route ML workloads optimally.

Modern laptops and workstations have multiple compute units: a discrete GPU (CUDA), an integrated GPU (iGPU/DirectML), a Neural Processing Unit (NPU), and the CPU. Most ML frameworks pick one device and stick with it. That's wasteful.

device-router detects what's available and routes each workload to the best device automatically.

Why it matters

Workload Best device Why
Single embedding CPU No GPU transfer overhead (~9μs)
Small model (int8) CPU (VNNI) CPU has dedicated VNNI instructions
Medium model batched iGPU Good compute, low power
Large model training CUDA GPU Parallelism + AMP
ONNX inference CPU ONNX Runtime is CPU-optimized

Install

pip install device-router

Optional dependencies:

pip install device-router[cuda]      # CUDA GPU detection via torch
pip install device-router[directml]  # iGPU detection via torch-directml
pip install device-router[all]       # Everything
pip install device-router[dev]       # pytest + numpy for development

Quick start

from device_router import DeviceRouter, RoutingStrategy

router = DeviceRouter()
router.detect()  # Finds CUDA, DirectML, CPU features, NPU

# Route a workload
decision = router.route(
    model_size=1_000_000,  # parameters
    batch_size=32,
    precision="fp32",      # or "fp16", "bf16", "int8"
    strategy=RoutingStrategy.AUTO,
)
print(f"Use {decision.device} ({decision.reason})")
# → Use cuda (Medium/large model (1,000,000 params) — GPU recommended)

# System overview
overview = router.overview()
# Returns: {cuda: {...}, cpu: {...}, igpu: {...}, npu: {...}}

Routing strategies

Strategy Description Use case
AUTO Best guess based on model size & batch Default
LATENCY Optimize for single-sample speed Real-time inference
THROUGHPUT Optimize for batch processing Batch jobs
POWER Prefer CPU/iGPU for efficiency Laptops, mobile

How it works

Without any dependencies

device-router runs pure CPU detection:

  • CPU architecture, core count, frequency
  • Instruction set features (AVX, AVX2, AVX-512, VNNI, AMX, NEON, SSE4)
  • This is enough to route small models optimally

With torch installed

Adds CUDA detection:

  • GPU count, name, VRAM, compute capability
  • CUDA/cuDNN version
  • Enables AMP and GPU benchmarking

With torch-directml installed

Adds iGPU detection:

  • DirectML device availability
  • Enables iGPU offloading for medium workloads

Routing decision logic

ONNX model → CPU (always)
Training → CUDA (if available) or CPU
Small model (<100K params) → CPU
  + int8 + VNNI → CPU with VNNI optimization
Medium model (100K-10M) → CUDA > DirectML > CPU
Large model (>10M) + batched → CUDA with AMP

API

DeviceRouter

router = DeviceRouter()
router.detect()                    # Scan for devices
router.overview()                   # Get system overview
router.route(model_size, batch_size, precision, strategy)  # Route workload
router.assign("cuda")               # Get torch.device for device string

RoutingDecision

decision.device       # "cuda", "cpu", "directml", "npu"
decision.reason       # Human-readable explanation
decision.precision    # Recommended precision
decision.use_amp      # Whether to use mixed precision
decision.confidence   # Confidence (0-1)

SuperInstance Mesh integration

# entry_point: superinstance.plugins
def register_device_router(registry):
    from device_router import DeviceRouter
    registry.register("devices", "router", DeviceRouter)

Running tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Ecosystem

Part of the SuperInstance ecosystem:

Package Description
plato-core Base types + mesh registry
tensor-spline SplineLinear neural compression
eisenstein-embed 5-layer matching cascade
plato-training Training monolith
device-router Heterogeneous compute routing
triplet-miner Git-powered contrastive data
micro-onnx ONNX export + benchmark

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

device_router-0.1.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

device_router-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file device_router-0.1.0.tar.gz.

File metadata

  • Download URL: device_router-0.1.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for device_router-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dfea39391aeefdfe6814293ac45115dd65ba336f2c1a255939396c6230730fce
MD5 a5db98c5b5e825a49f2d522c6a20d1e9
BLAKE2b-256 e5f8013369fa357a576cb8ab7bff4cbaf6e53162dcc9d7fcad4beb272db645d4

See more details on using hashes here.

File details

Details for the file device_router-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: device_router-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for device_router-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a08fbb97a4e6c025b0a32df275cf58ac3cfff59549a4de8f3481c2919f281ef
MD5 1533102de872e05531cca44a6e02086f
BLAKE2b-256 ec2cd5e040f7b767340e00653c784848dbb88586e3a25d90c869eedbfee9b7d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page