Skip to main content

nginx-style workload scheduler for limited-VRAM GPU bench setups

Project description

nvinx

CI PyPI Python License: MIT

Your 4 GB GPU is not idle during a long inference — it's a scheduling constraint.

When a single model saturates a small GPU for hours, the CPU, system RAM, and the wall clock are still available. Most research tooling doesn't schedule against those idle resources — it just waits. nvinx is three small, composable patterns that turn "wait" into "work" on limited-VRAM benches.

Extracted from a real single-GPU research workload (4 GB VRAM + 32 GB RAM + 8 cores). The patterns are generic; the workload is not published.

Status: v0.2.0a1 in-tree (v0.1.0 on PyPI). API may still change between minor versions; pin your dependency.


The reframe

A research bench with one small GPU has more compute than it looks like. During a 6-hour basecaller or folding run, the GPU is saturated — but the CPU, system RAM, and I/O channels are not. The usual response is to treat that window as a blocker and run the remaining steps serially afterward, doubling total wall-clock.

The reframe: the GPU-exclusive window is a scheduling opportunity, not a dead time. CPU-only work (variant calling, report generation, data ingest, classical-ML inference) can run in parallel for free. Small-VRAM models can co-reside with each other. Models that exceed VRAM can spill to RAM via HuggingFace accelerate with a known 3–5× slowdown — which is often strictly better than not running at all.

nvinx packages three named patterns for these cases. It is pure scheduling logic: no model invocation, no memory reservation, no runtime. You call a pattern with ModelSpec inputs and a HardwareSpec envelope, you get a SchedulingPlan out, and your runtime executes the plan. This keeps nvinx testable, framework-agnostic, and composable with whatever runtime you already have.


Install

pip install nvinx

From source:

git clone https://github.com/aki1770-del/nvinx
cd nvinx
pip install -e ".[dev]"
pytest

Python ≥ 3.10. Dependencies: pyyaml only.


The three patterns

Pattern A — serial_handoff: CPU work during a GPU-exclusive window

When: one long GPU job saturates the card, and you have other work that doesn't need the GPU.

Why: the GPU-exclusive window is the CPU's foreground production window. Running CPU-only work sequentially after the GPU job wastes wall-clock.

from nvinx.catalog import HardwareSpec, ModelSpec, Residency
from nvinx.patterns import serial_handoff

hw = HardwareSpec(vram_gb=4.0, ram_gb=32.0, cpu_cores=8)

gpu_job = ModelSpec(
    name="long_inference",
    vram_gb=3.5,
    residency=Residency.GPU_EXCLUSIVE,
)

cpu_candidates = [
    ModelSpec(name="feature_extractor",  vram_gb=0.0, residency=Residency.CPU_ONLY),
    ModelSpec(name="report_writer",      vram_gb=0.0, residency=Residency.CPU_ONLY),
    ModelSpec(name="vector_index_build", vram_gb=0.0, residency=Residency.CPU_ONLY),
]

plan = serial_handoff(gpu_job, cpu_candidates, hw)

print(plan.gpu_foreground.name)         # "long_inference"
print([m.name for m in plan.cpu_parallel])
# ['feature_extractor', 'report_writer', 'vector_index_build']

The pattern refuses to run if the GPU job isn't marked GPU_EXCLUSIVE or exceeds the hardware's VRAM. CPU candidates without cpu_fallback_supported=True (and not CPU_ONLY) land in plan.unscheduled.

Pattern B — fractional_coresidency: pack small-VRAM models into a shared budget

When: multiple models each small enough to fit alongside others (e.g. a classifier + an embedder + a re-ranker).

Why: modern research stacks often chain several small models. Running them sequentially on a shared GPU means repeated load/unload cycles. Packing them into one co-resident window avoids the churn and cuts pipeline latency.

from nvinx.patterns import fractional_coresidency

candidates = [
    ModelSpec(name="classifier_a", vram_gb=0.7, residency=Residency.GPU_SHARED),
    ModelSpec(name="embedder",     vram_gb=0.3, residency=Residency.GPU_SHARED),
    ModelSpec(name="reranker",     vram_gb=1.2, residency=Residency.GPU_SHARED),
    ModelSpec(
        name="oversized_model",
        vram_gb=4.0,
        residency=Residency.GPU_SHARED,
        cpu_fallback_supported=True,
    ),
]

plan = fractional_coresidency(candidates, hw)

print([m.name for m in plan.gpu_coresident])
# ['reranker', 'classifier_a', 'embedder']  # packed 2.2 / 3.5 GB after headroom

print([m.name for m in plan.cpu_parallel])
# ['oversized_model']  # falls back to CPU because it doesn't fit

Greedy bin-packing by descending vram_gb. The budget is hw.vram_gb − headroom_gb (default 0.5 GB for activations and kernel launches). GPU_EXCLUSIVE models are never co-resident; CPU_ONLY models always land in cpu_parallel.

Pattern C — ram_overflow: spill layers from VRAM to system RAM

When: one model exceeds the VRAM budget but system RAM is plentiful, and the model supports layer offloading.

Why: a 6 GB model on a 4 GB card normally OOMs on load. With HuggingFace accelerate's device_map="auto", the layers that don't fit in VRAM run from system RAM. The tradeoff is a 3–5× slowdown — which is often strictly better than not running the model at all.

from nvinx.patterns import ram_overflow

big_model = ModelSpec(
    name="big_folder",
    vram_gb=6.0,
    residency=Residency.GPU_RAM_OVERFLOW,
    ram_overflow_supported=True,
    ram_gb_needed=10.0,
)

hint = ram_overflow(big_model, hw)

print(hint["device_map"])   # "auto"
print(hint["max_memory"])   # {0: '3.5GiB', 'cpu': '6.5GiB'}
print(hint["estimated_slowdown"])
# "3-5x vs. pure-GPU inference"

# Pass through to your runtime:
# from transformers import AutoModelForXxx
# AutoModelForXxx.from_pretrained(
#     "model-name",
#     device_map=hint["device_map"],
#     max_memory=hint["max_memory"],
# )

ram_overflow returns a hint dict (not a SchedulingPlan) because the runtime contract is different — you're passing keyword arguments to from_pretrained, not placing multiple models. This is a deliberate asymmetry.


Which pattern do I need?

Situation Pattern
One big GPU job running for hours + other work that doesn't need the GPU A serial_handoff
Several models that each fit in VRAM alongside others B fractional_coresidency
One model that exceeds VRAM on load C ram_overflow
Mix: long GPU job + multiple small follow-on models on the same GPU A for the long job, then B on the follow-on window
Model that doesn't fit and no RAM headroom None — buy more RAM, quantize, or switch models

The three patterns compose: a realistic day on a 4 GB bench runs Pattern A during a nightly basecaller, Pattern B across a morning of small-model inference, and Pattern C whenever a large folding model is needed. nvinx doesn't orchestrate the transitions — it gives you the placement decisions one at a time.


v0.2 (in-tree): interference prediction for Pattern B

Pattern B v0.1 packs models that fit. v0.2 adds interference prediction primitives for operators who want to know whether a packed placement will hit SLO before they run it.

The pain v0.2 addresses. Pattern B v0.1 says yes/no on packing — but doesn't tell you whether the packed models will interfere. Two small models that fit in VRAM may still slow each other down 3-5× under co-residency due to GPU kernel-queue contention. The operator wants:

  • "If I pack model A and model B, what latency should I expect?"
  • "Is this placement near the SLO bound?"
  • "Which model will suffer more if they co-reside?"

v0.2 solution (substrate-native). A new module nvinx.interference provides:

  • InterferenceProfile — per-model coefficients (operator-profiled on your substrate)
  • HardwareCoefficients — substrate-level coefficients (one-time per bench)
  • predict_pair_latency_queue_aware() — queue-aware formula: latency_i = act_solo_i × (1 + θ_i × partner_act / (act_solo_i + partner_act)) + scheduling_delay
  • max_kernel_rate_score() — pre-filter heuristic (Spearman ρ ≈ 0.50 with measured slowdown on the validation bench)
  • asymmetry_predictor()act_solo_ratio for which-suffers-more (ρ ≈ 0.72)
  • predict_pair_latency() — tiered: lookup → queue-aware → fallback
  • PairLookupEntry — per-pair measured ground truth (safety net for known high-error pairs)

fractional_coresidency_v2() accepts these as optional inputs and augments the plan's notes with predictions. If you don't supply profiles, it's equivalent to v0.1 fractional_coresidency (no behaviour change).

Honest scope. The queue-aware formula was validated on one substrate: a 4 GB RTX A1000 mobile bench (Ampere sm_86) with a heterogeneous transformer/LLM workload mix (ESM-2-150M at 3 sequence lengths + Qwen-0.5B + Whisper-base). On that substrate's 4-model 6-pair corpus, the formula achieves ~16% LOPO mean error; on the 5-model 10-pair extended corpus, ~25% LOPO mean. Persistent ~30% LOPO outlier on 2-small-kernel pairs (formula limit; lookup safety net handles those).

Substrate-bound. If your bench differs (different GPU, different model class, different driver), the published coefficients are not portable — you must recalibrate. See docs/calibrating-your-substrate.md for the operator workflow.

Code example.

from nvinx.catalog import HardwareSpec, ModelSpec, Residency
from nvinx.interference import HardwareCoefficients, InterferenceProfile
from nvinx.patterns import fractional_coresidency_v2

hw = HardwareSpec(vram_gb=4.0, ram_gb=32.0, cpu_cores=20)

# Substrate-level coefficients (you fit these once via the calibration workflow)
hw_coefs = HardwareCoefficients(
    idlef_polynomial=(6.42, -7.0),
    powerp_linear=(0.0,),
    nominal_freq_mhz=1530.0,
    tdp_watts=40.0,
    substrate_name="rtx_a1000_4gb",
)

# Per-model coefficients (you profile these for each model on your substrate)
profile_a = InterferenceProfile(
    name="model_a",
    kernels=1027, baseidle_ms=0.077, act_solo_ms=21.7,
    l2_saturation_pct=16.8, theta=3.91,
    architecture_class="encoder_transformer",
)
profile_b = InterferenceProfile(
    name="model_b",
    kernels=1026, baseidle_ms=0.149, act_solo_ms=114.9,
    l2_saturation_pct=37.1, theta=1.20,
    architecture_class="encoder_transformer",
)

candidates = [
    ModelSpec(name="model_a", vram_gb=0.6, residency=Residency.GPU_SHARED),
    ModelSpec(name="model_b", vram_gb=0.6, residency=Residency.GPU_SHARED),
]

plan = fractional_coresidency_v2(
    candidates, hw,
    interference_profiles={"model_a": profile_a, "model_b": profile_b},
    hw_coefs=hw_coefs,
    max_kernel_rate_threshold=50.0,
)

for note in plan.notes:
    print(note)
# Pattern B (fractional_coresidency): 2 model(s) co-resident; 1.20/3.50 GB VRAM used (+ 0.5 GB headroom).
# interference: max_kernel_rate=47.3 k/ms (2/2 placed models have profiles)
# interference: pair(model_a+model_b) pred_lat=(86.5, 141.7)ms via queue_aware; asymmetry=5.30

The placement decision is unchanged from v0.1 (greedy bin-pack). The v0.2 additions are diagnostic + advisory — they help you decide whether to accept the placement.


Data model

Four types live in src/nvinx/catalog.py:

  • HardwareSpec(vram_gb, ram_gb, cpu_cores) — the physical envelope.
  • ModelSpec(name, vram_gb, residency, cpu_fallback_supported, ram_overflow_supported, ram_gb_needed) — a workload to schedule.
  • Residency — enum: GPU_EXCLUSIVE, GPU_SHARED, CPU_ONLY, GPU_RAM_OVERFLOW.
  • SchedulingPlan — the output: gpu_foreground, gpu_coresident, cpu_parallel, overflow, unscheduled, notes.

Every pattern takes ModelSpecs and a HardwareSpec, returns a SchedulingPlan (or a hint dict, for Pattern C). The pattern functions are pure; they never touch the GPU.


What nvinx is not

  • Not a runtime. nvinx returns placement decisions. You execute them with whatever runtime you already use (PyTorch, HuggingFace transformers, vLLM, custom).
  • Not a job queue. Pattern A assumes you already know which job is GPU-exclusive right now. Higher-level orchestration (what runs first? what triggers the next window?) is out of scope.
  • Not a profiler. ModelSpec.vram_gb is your declaration of the model's VRAM footprint. nvinx trusts it. If you don't know the footprint, measure it first; then call nvinx.
  • Not NVIDIA- or NGINX-specific. The name is a portmanteau. See the disclaimer below.

Algorithmic prior art

nvinx's patterns are not novel algorithms — they are explicit, composable formulations of techniques the GPU multi-tenancy and DL-inference-serving literature has documented at a peer-reviewed level. The contribution of this package is to bring those techniques to the small-VRAM bioinformatics-bench audience, where the algorithms have not yet been packaged for direct use.

  • Pattern B (fractional_coresidency) — algorithmic family established by iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud (Xu et al., IEEE TPDS 2022, 10.1109/TPDS.2022.3232715) and extended by ECLIP: Energy-efficient and Practical Co-Location of ML Inference on Spatially Partitioned GPUs (ISLPED 2025, 10.1109/ISLPED65674.2025.11261793). Both papers establish bin-packing + interference-aware placement for distinct ML inference models sharing a single GPU's VRAM. nvinx's fractional_coresidency is the same algorithmic family applied to small-VRAM bioinformatics workloads (protein language models + classifiers + structure predictors) where the literature is empirically thinner.

  • Pattern A (serial_handoff) — closely related to standard CPU-GPU pipeline overlap patterns documented across genomics-acceleration literature (e.g., GenPIP nanopore pipelining, SquiggleFilter virus detection accelerator). The single-bench framing is what differs; the algorithm itself is the canonical overlap pattern.

  • Pattern C (ram_overflow) — directly invokes HuggingFace accelerate's device_map="auto" mechanism. The contribution is the declarative interface; the offload mechanism is the well-documented accelerate capability for VRAM-exceeding model loading.

If your work cites nvinx for any of these patterns, please cite the underlying algorithmic-family reference as well — nvinx is the deployment-layer formulation, not the algorithmic primary source.


Status and stability

v0.2.0a1 is alpha in-tree (v0.1.0 is the most recent PyPI release). Expect:

  • Additions in v0.2.0a1: substrate-native interference primitives (nvinx.interference), fractional_coresidency_v2, expanded test suite (29 passing).
  • Backward compat: v0.1.0 API surface is unchanged. fractional_coresidency (no profiles) returns the same plan as before; fractional_coresidency_v2 falls back to v0.1 behaviour when interference_profiles=None.
  • Future: configs/YAML-driven spec loading, a scheduler layer that composes patterns across a day's workload, more patterns as edge cases surface.
  • Pin your version. SchedulingPlan may add fields in minor releases.

Contributing — case-study YAMLs wanted

The patterns were derived from one real research workload. They'll become more robust if other people throw their workloads at them and tell us what breaks.

The highest-value contribution right now is a case-study YAML describing your bench:

  • What hardware? (VRAM, RAM, cores)
  • Which models? (name, VRAM footprint, whether they support CPU fallback / RAM overflow)
  • Which pattern did you try, and what happened?
  • What was the old wall-clock, and what did nvinx change?

Submit as a pull request to examples/case_studies/ (or open an issue with the YAML inline). No anonymization requirement — feel free to keep your workload names vague if they're sensitive.

Also welcome:

  • Bug reports with a minimal reproduction
  • New pattern proposals (open an issue first so we can discuss scope before you implement)

Star the repo if you find the reframe useful — it's the signal that tells us to keep extracting patterns from the parent workload.


Running the tests and linters

pip install -e ".[dev]"

pytest              # 29 tests (8 v0.1 patterns + 21 v0.2 interference)
ruff format --check .
ruff check .

29 tests in two files: tests/test_patterns.py (8; v0.1 patterns) and tests/test_interference.py (21; v0.2 interference primitives). CI runs on every push and PR across Python 3.10–3.12.


License

MIT. See LICENSE.


Trademark disclaimer

nvinx is not affiliated with NVIDIA Corporation or NGINX Inc. / F5, Inc. The name is a portmanteau reflecting the nginx-inspired upstream-routing metaphor applied to heterogeneous compute. If trademark concerns surface, the fallback package name is nvginx.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nvinx-0.2.0a1.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvinx-0.2.0a1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file nvinx-0.2.0a1.tar.gz.

File metadata

  • Download URL: nvinx-0.2.0a1.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nvinx-0.2.0a1.tar.gz
Algorithm Hash digest
SHA256 60dabd220ac3108a7b38194b8571aef69c0c569fa55cc799059e9107510c328c
MD5 3e7e56110fc0ba60492dcf857dd04e93
BLAKE2b-256 aafbd76c7432a328c5044f16da362ea3ac98cc9a93d80b164a6459068abc421d

See more details on using hashes here.

File details

Details for the file nvinx-0.2.0a1-py3-none-any.whl.

File metadata

  • Download URL: nvinx-0.2.0a1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for nvinx-0.2.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 562f9c34fd02fffe6a48bedde7a15646d1a1b67f1252b805c073a0fe5f37b2de
MD5 7e6398fd59d6633ee2ff77fe6802e4bd
BLAKE2b-256 44c0efdfc4b92b8eb13950678e70575ea9cf310af0ff1af4671bdfe7941d9a14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page