Skip to main content

PyTorch DataLoader with dynamic batch sizing guided by a pre-trained GPU memory regressor.

Project description

dynabatch

dynabatch is a drop-in batching utility for variable-length text generation workloads. It first does Max Token Sampler/Batching by sorting inputs by length, then adds a pre-trained regressor on top to increase batch size on shorter examples while keeping memory pressure relative to the first, hardest batch.

It is mainly built and tested for encoder-decoder machine translation style workloads, where input length is a decent proxy for output length and memory usage.

Installation

pip install dynabatch

When dynabatch helps

dynabatch is most useful when:

  • long examples force you to choose a conservative fixed batch size
  • that conservative batch size leaves GPU compute underutilized on the many shorter examples later in the dataset
  • your task has a reasonably predictable relation between input length and generation cost

It is generally a better fit for encoder-decoder models than for decoder-only LLMs. For decoder-only training or inference, sequence packing is often the stronger optimization because it reduces padding waste by filling token slots directly inside packed sequences. dynabatch can still help on decoder-only workloads in some cases, but it is not where I would position the library first.

This is the common translation scenario:

  • a few very long inputs force a small safe batch size because they eat a lot of VRAM
  • once those hard batches are out of the way, later shorter batches could fit many more examples
  • increasing batch size there improves throughput and reduces wasted padding

It is less useful when the GPU is already compute-bound even at the smallest safe batch size. In that case, making the batch larger does not buy much. If you want to check that, compare:

  • dynamic_batch_mode=True
  • dynamic_batch_mode=False

If both behave similarly, dynabatch is probably not your bottleneck.

Quick Start

dynabatch_sampler is a batch sampler: use DataLoader(..., batch_sampler=sampler) (omit batch_size). Align dataset with texts and tokenizer max length with max_input_token_length. See notebooks/dynabatch_inference_comparison.ipynb for a full cell.

from torch.utils.data import DataLoader
from dynabatch import dynabatch_sampler

sampler = dynabatch_sampler(texts, tokenizer, batch_size=32, max_input_token_length=256)
loader = DataLoader(dataset, batch_sampler=sampler, collate_fn=collate_fn)

Or build_dynabatch_dataloader(texts, tokenizer, batch_size=32, max_input_token_length=256) for a built-in loader.

Notebooks

Notebooks Comparison Notes
Inference
🟠🟡Colab
Inference comparison table
  • Ran on Colab T4
  • Modest speedups overall
  • Bigger wins on heavy models (e.g. NLLB, Qwen): high memory use → smaller static batches → more compute-bound; dynamic batching helps more
  • Faster GPUs at the same VRAM may see bigger gains than on a T4
Training
🟠🟡NotYet

More Examples

Compare dynamic vs static batching

from torch.utils.data import DataLoader
from dynabatch import dynabatch_sampler

kw = dict(texts=texts, tokenizer=tokenizer, batch_size=32, max_input_token_length=256)
dynamic = DataLoader(
    dataset,
    batch_sampler=dynabatch_sampler(**kw, dynamic_batch_mode=True),
    collate_fn=collate_fn,
)
static = DataLoader(
    dataset,
    batch_sampler=dynabatch_sampler(**kw, dynamic_batch_mode=False),
    collate_fn=collate_fn,
)

dynamic_batch_mode=False behaves like Max Token Sampler/Batching without the regressor-driven dynamic resizing. In other words, dynabatch is:

  • Max Token Sampler/Batching
  • plus optional dynamic batch growth on top

That makes dynamic_batch_mode=False useful as a sanity check.

OOM-safe generation with fallback splitting

The regressor is empirical, so it can still occasionally predict a batch size that turns out too aggressive for a specific model, prompt template, GPU state, or generation setting. generate_with_oom_fallback() lets you keep the run alive by splitting only the failed batch into smaller chunks.

import torch
from torch.utils.data import DataLoader
from dynabatch import dynabatch_sampler, generate_with_oom_fallback

loader = DataLoader(
    dataset,
    batch_sampler=dynabatch_sampler(texts, tokenizer, batch_size=32, max_input_token_length=256),
    collate_fn=collate_fn,
)
device = torch.device("cuda")

with torch.inference_mode():
    for batch in loader:
        generated_tokens, did_fallback = generate_with_oom_fallback(
            model, batch, min_batch_size=32, device=device, max_new_tokens=128,
        )

        if did_fallback:
            print("Fallback path used for this batch after an OOM.")

This is useful when you want throughput from dynamic batching without letting one occasional OOM kill a long inference run.

Training-style usage

For training:

  • if you want hardware friendly sizes (2^n or 3 * 2^n), enable friendly_batch_size=True
  • if you want to avoid odd batch sizes, keep keep_batch_size_even=True (default)
  • if you want shuffled batches, set shuffle=True
  • shuffle_keep_first_n=3 means the first 3 hardest batches stay unshuffled and only the later batches are shuffled
  • keeping the earliest hardest batches fixed is useful because it lets you hit the worst memory cases early and find OOM problems sooner
from torch.utils.data import DataLoader
from dynabatch import dynabatch_sampler

train_loader = DataLoader(
    dataset,
    batch_sampler=dynabatch_sampler(
        texts,
        tokenizer,
        batch_size=16,
        max_input_token_length=256,
        friendly_batch_size=True,
        shuffle=True,
        shuffle_keep_first_n=3,
    ),
    collate_fn=collate_fn,
)

How It Works

  1. All texts are tokenized up front to estimate truncated token, word, and character lengths.
  2. Samples are sorted by token length from longest to shortest. This part alone is essentially Max Token Sampler/Batching.
  3. The first batch uses exactly batch_size items. This is the hardest batch and becomes the baseline.
  4. For every later batch, dynabatch builds candidate batch sizes from batch_size up to batch_size * max_batch_range.
  5. A pre-trained XGBRegressor predicts memory pressure for each candidate relative to the first batch.
  6. dynabatch chooses the largest candidate whose predicted load is less than or equal to threshold.
  7. If dynamic_batch_mode=False, step 5 and step 6 are skipped and the pipeline reduces to Max Token Sampler/Batching with fixed batch size.

The important intuition is:

  • around 1.0 means "about as memory heavy as the first batch"
  • below 1.0 means lighter than the first batch
  • above 1.0 means heavier than the first batch and therefore riskier

So you should choose batch_size as the largest batch of your longest inputs that safely fits on your GPU. The regressor then tries to grow from there when the later inputs get shorter.

API

dynabatch_sampler

Returns DynaBatchSampler for DataLoader(..., batch_sampler=sampler). Same sizing/shuffle kwargs as build_dynabatch_dataloader; dataset indices must match texts.

dynabatch_sampler(
    texts: list[str],
    tokenizer: PreTrainedTokenizerBase,
    batch_size: int,
    max_input_token_length: int = 512,
    threshold: float = 0.65,
    max_batch_range: float = 2.0,
    shuffle: bool = False,
    shuffle_seed: int = 21,
    shuffle_keep_first_n: int = 3,
    friendly_batch_size: bool = False,
    keep_batch_size_even: bool = True,
    num_workers: int = 4,
    debug: bool = False,
    dynamic_batch_mode: bool = True,
    smooth_batches: bool = True,
    smooth_batches_max_diff: float = 0.2,
) -> DynaBatchSampler

build_dynabatch_dataloader

Same batching as dynabatch_sampler, returns a DataLoader with built-in collation; extra kwargs go to the tokenizer.

build_dynabatch_dataloader(
    texts: list[str],
    tokenizer: PreTrainedTokenizerBase,
    batch_size: int,
    max_input_token_length: int = 512,
    threshold: float = 0.65,
    max_batch_range: float = 2.0,
    shuffle: bool = False,
    shuffle_seed: int = 21,
    shuffle_keep_first_n: int = 3,
    friendly_batch_size: bool = False,
    keep_batch_size_even: bool = True,
    num_workers: int = 4,
    debug: bool = False,
    dynamic_batch_mode: bool = True,
    smooth_batches: bool = True,
    smooth_batches_max_diff: float = 0.2,
    **tokenizer_kwargs,
) -> DataLoader
Parameter Description
texts Raw input strings.
tokenizer Any tokenizer compatible with the Hugging Face tokenizer interface.
batch_size The baseline batch size for the first, longest batch. In practice, set this to the largest safe batch size for your worst-case inputs.
max_input_token_length Hard truncation limit used while estimating lengths and later tokenizing the batches.
threshold Maximum allowed regressor prediction for a candidate batch. Roughly, 1.0 means "as memory-heavy as the first batch". Lower values are more conservative.
max_batch_range Upper multiplier for candidate batch sizes relative to batch_size. With batch_size=32 and max_batch_range=2.0, dynabatch will search up to about 64.
shuffle Shuffles the already-built batches. Within a batch, lengths stay similar. This is batch-level shuffling, not full random token-level mixing.
shuffle_seed Seed used when shuffling.
shuffle_keep_first_n Keeps the first few hardest batches in original order before shuffling the rest. For example, 3 means the first 3 longest/hardest batches remain fixed so you can detect early OOM issues quickly.
friendly_batch_size Rounds chosen batch sizes down to hardware-friendly values such as powers of two or 3 * 2^n. Useful for some training setups.
keep_batch_size_even If True, rounds chosen batch sizes to even numbers. Enabled by default and useful for setups that prefer even per-step microbatch sizes.
num_workers Worker count for the length pre-pass (datasets.map) and for the returned DataLoader.
debug Disables parallel workers for the length pass and enables verbose sampler logging.
dynamic_batch_mode If True, uses the regressor to vary batch size. If False, the loader reduces to Max Token Sampler/Batching with fixed batch size. This is the main switch for testing whether the dynamic part is actually helping your workload.
smooth_batches If True, applies a smoothing pass after dynamic sizing so adjacent batches do not jump too abruptly in size.
smooth_batches_max_diff Controls the largest allowed growth between adjacent batches as a fraction of batch_size. Example: 0.2 allows at most 0.2 * batch_size extra items per step (still bounded by max batch size).
**tokenizer_kwargs Extra keyword arguments forwarded to the tokenizer during collation (for example truncation=True).

The returned DataLoader yields dictionaries containing input_ids, attention_mask, texts, and any other tokenizer outputs.

Regressor Training

The training pipeline and notebook notes now live in train_regressor/readme.md.

In short:

  • the training data stores real GPU memory usage from many batch configurations
  • the target is memory usage relative to the first batch
  • the notebook trains an XGBRegressor to predict that ratio from token, word, and character statistics of the baseline batch and candidate batch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dynabatch-0.2.17.tar.gz (905.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dynabatch-0.2.17-py3-none-any.whl (927.5 kB view details)

Uploaded Python 3

File details

Details for the file dynabatch-0.2.17.tar.gz.

File metadata

  • Download URL: dynabatch-0.2.17.tar.gz
  • Upload date:
  • Size: 905.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dynabatch-0.2.17.tar.gz
Algorithm Hash digest
SHA256 3319a1c8165b7e168ca8fab1c635f3bafe98d81dbdcf9c33030b0957479789cc
MD5 ffc112681dd81e1fd722ffce0dc0c415
BLAKE2b-256 fc16a0d27e21515e08af6ca24685c18a7ebdfbc385b86e964a429383cb422bc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for dynabatch-0.2.17.tar.gz:

Publisher: release.yml on bendangnuksung/dynabatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dynabatch-0.2.17-py3-none-any.whl.

File metadata

  • Download URL: dynabatch-0.2.17-py3-none-any.whl
  • Upload date:
  • Size: 927.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dynabatch-0.2.17-py3-none-any.whl
Algorithm Hash digest
SHA256 9c61cd7f54628d8e598758e1458f6bb9fe2387608d67f882d30634e3791923f0
MD5 4a0365bd0e2b6b24c30e350e7ada4533
BLAKE2b-256 91cf0ab789058b1a1a2a5c97fbf026657991d0c34553996d7037c921cd4e48ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for dynabatch-0.2.17-py3-none-any.whl:

Publisher: release.yml on bendangnuksung/dynabatch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page