Skip to main content

Online Dynamic Batching (ODB) — a PyTorch DataLoader-side integration that dynamically groups sequences by length and adjusts batch sizes on-the-fly.

Project description

Online Dynamic Batching

CI Python License Status arXiv

PyTorch HuggingFace LLaMA-Factory Accelerate Lightning

Online Dynamic Batching (ODB) speeds up LLM/VLM training with one PyTorch DataLoader line.

Replace your PyTorch DataLoader constructor with odb.ODBDataLoader(...) to enable online dynamic batching at the DataLoader boundary. For frameworks that own the DataLoader or Trainer, use one of the more integration methods.

It waits until each sample has passed through the real input pipeline: tokenization, chat templates, image-token expansion, truncation, augmentation, and collation inputs. ODB then forms token-budgeted batches online. Short examples get larger batches, long examples get smaller batches, and your model, optimizer, attention kernels, and dataset format can stay where they are.

ODB is deliberately an adapter-layer package. It does not try to make different multimodal processors, chat templates, or dataset implementations produce identical tensors. Bring your framework's existing Dataset/collator path; ODB starts once that path can emit fully processed single samples.

ODB online grouping animation

Paper

ODB is described in the arXiv paper Online Dynamic Batching with Formal Guarantees for LLM Training.

@misc{li2026online,
  title         = {Online Dynamic Batching with Formal Guarantees for LLM Training},
  author        = {Dian Li and Zekun Wang and Yaoru Wang and Jiahong Yan},
  year          = {2026},
  eprint        = {2606.19989},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DC},
  url           = {https://arxiv.org/abs/2606.19989}
}
import odb

# One-line acceleration path: replace DataLoader(...) with ODBDataLoader(...).
dataloader = odb.ODBDataLoader(
    dataset,
    token_budget=16384,
    batch_size=1,
    shuffle=True,
    num_workers=4,
    prefetch_factor=64,
    collate_fn=collate_fn,
    loss_scaling="exact",
    join=True,                  # default; set join=False only when needed
)

for batch in dataloader:
    info = odb.pop_step_info(batch, loss_scaling="exact")
    loss = model(**batch).loss
    loss = loss * info.loss_scale
    loss.backward()

Need a framework-owned DataLoader or Trainer adapter? See more integration methods for HuggingFace Trainer, LLaMA-Factory/LLaVA-Factory, Accelerate, and Lightning.

Why ODB

Modern training pipelines often do not know the true training length at dataset index time. A multimodal or instruction-tuning sample may change length after:

  • applying a chat template;
  • expanding images into vision tokens;
  • truncating to a cutoff;
  • adding stochastic augmentation;
  • mixing multiple data sources with different processors.

Classic fixed-size batching wastes padding. Offline length caches can help, but they need a separate preprocessing pass and can go stale when the runtime input pipeline changes. ODB moves batching to the point where real length is already observable: the DataLoader/collate boundary.

What You Get

  • DataLoader replacement path: use ODBDataLoader(...) when you control DataLoader construction.
  • Existing DataLoader path: use odb.apply(dataloader, ...) when a framework has already created the DataLoader.
  • DDP-ready dynamic batching: ODB aligns grouping across ranks with a small metadata exchange.
  • Default join-mode protocol: strict identity-coverage termination for final DDP training runs; set join=False only for constrained runtimes that cannot support drain-before-finish semantics.
  • Correct loss scaling: odb.pop_step_info(...) returns the current all-rank sample count and the per-rank loss multiplier.
  • Trainer integrations: PyTorch loops, HuggingFace Trainer, LLaMA-Factory-style trainers, Accelerate loops, and Lightning modules.
  • Production-shaped benchmark coverage: text, multimodal, LoRA/full FT, single-node, multi-node, oracle baselines, and high-variance production mixes.

Integration Boundary

ODB operates after your input pipeline has converted raw records into tensors:

raw data
  -> model/framework processor adapter
  -> ODB-ready single-sample tensors
  -> ODB
  -> Trainer/loop

The model/framework processor adapter is where model-specific work belongs: chat templates, tokenization, image or video processors, visual-token expansion, truncation, and label masking. Different models can use different adapters, but they should emit the same ODB contract: a single-sample tensor dict with input_ids, attention_mask, labels, and any model-required multimodal tensors.

The core package and non-LLaMA-Factory adapters do not import or require LLaMA-Factory. HuggingFace Trainer, Accelerate, and Lightning users should keep their own tokenizer/processor/template/collator semantics and use ODB only at the DataLoader/trainer boundary. If you need a paper-style MM-Mix reference, use the separate LLaMA-Factory-based example project.

For raw multimodal records, the framework adapter is not a replacement for a model-specific processor pipeline. Make the Dataset emit ODB-ready tensor samples first, then attach ODB to the DataLoader or Trainer.

Installation

From PyPI:

pip install online-dynamic-batching

# HuggingFace Trainer / LLaMA-Factory adapters
pip install "online-dynamic-batching[hf]"

# Accelerate or Lightning adapters
pip install "online-dynamic-batching[accelerate]"
pip install "online-dynamic-batching[lightning]"

From GitHub:

pip install "online-dynamic-batching @ git+https://github.com/online-dynamic-batching/online-dynamic-batching.git"

Local development:

git clone https://github.com/online-dynamic-batching/online-dynamic-batching.git
cd online-dynamic-batching
pip install -e ".[dev,all]"
pytest

Quick Start

Replace DataLoader Construction

Use this when you own the DataLoader code.

import odb

dataloader = odb.ODBDataLoader(
    dataset,
    token_budget=16384,
    batch_size=1,              # ODB forms the real batch dynamically
    shuffle=True,
    num_workers=4,             # ODB requires worker prefetching
    prefetch_factor=64,
    collate_fn=collate_fn,
    loss_scaling="exact",      # "none", "approx", or "exact"
    join=True,                  # default; set join=False only when needed
)

Patch An Existing DataLoader

Use this when a framework constructs the DataLoader for you.

from torch.utils.data import DataLoader
import odb

dataloader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=True,
    num_workers=4,
    prefetch_factor=64,
    collate_fn=collate_fn,
)

handle = odb.apply(
    dataloader,
    token_budget=16384,
    loss_scaling="exact",
    join=True,                  # default; set join=False only when needed
)

Consume ODB Metadata Before Forward

ODB adds trainer-facing metadata to each yielded batch. Remove it before model(**batch) and use it for correct progress/loss accounting.

for batch in dataloader:
    info = odb.pop_step_info(batch, loss_scaling="exact")

    loss = model(**batch).loss
    loss = loss * info.loss_scale
    loss.backward()

    emitted_samples += info.all_samples_this_step

info.all_samples_this_step is the all-rank emitted sample count for the current micro-step. info.loss_scale is the current-rank multiplier that makes DDP gradient averaging match the intended global sample/token weighting.

More Integration Methods

Start with ODBDataLoader(...) when you control DataLoader construction. If a framework owns the DataLoader or Trainer, choose one of these alternatives.

Method Best For What ODB Handles
Patch an existing DataLoader Framework-created DataLoaders odb.apply(dataloader, ...) adds ODB without changing the constructor site.
Enable HF Trainer ODB-ready HuggingFace Trainer pipelines enable_odb(...) wires DataLoader grouping, metadata, and Trainer accounting.
Configure an existing Trainer Existing HuggingFace-style trainer instances needing lower-level control configure_trainer(...) registers callbacks and loss scaling.
Enable LLaMA-Factory ODB-ready LLaMA-Factory data pipelines enable_odb(...) wires DataLoader grouping and Trainer accounting.
Native trainer/mixin Framework forks or new trainers ODBTrainerMixin consumes metadata inside compute_loss.

HuggingFace Trainer

from odb.integrations.hf import enable_odb

trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
dataloader = trainer.get_train_dataloader()

enable_odb(
    trainer,
    train_dataloader=dataloader,
    train_dataset=dataset,
    token_budget=16384,
    loss_scaling="exact",
    join=True,
)

trainer.train()

Native Trainer Class

from odb.integrations.hf import ODBTrainerMixin

class MyTrainer(ODBTrainerMixin, CustomTrainer):
    pass

LLaMA-Factory-Style Trainers

from odb.integrations.llamafactory import enable_odb

enable_odb(
    trainer=trainer,
    train_dataloader=train_dataloader,
    training_args=training_args,
    train_dataset=train_dataset,
    token_budget=16384,
    loss_scaling="exact",
)

The LLaMA-Factory adapter is the complete one-line integration path when the LLaMA-Factory data pipeline already produces ODB-ready single-sample tensor dicts. It validates the DataLoader boundary, enables ODB grouping, and resolves Trainer accounting such as sample-budget stopping, join mode, and exact loss scaling. HuggingFace Trainer, Accelerate, and Lightning entries are trainer or loop adapters until their own raw-data pipeline adapters are added.

See docs/integration-guides for PyTorch, HuggingFace Trainer, LLaMA-Factory, Accelerate, and Lightning integration details. The 0.1.1 validation notes are summarized in docs/validation.md.

Try It Without Private Data

Run a CPU/single-GPU synthetic benchmark that compares fixed-size batching and ODB on a long-tail sequence distribution:

python examples/synthetic_benchmark.py --device auto --num-samples 2048

For a copy-paste learning path, open examples/notebooks/odb_single_gpu_demo.ipynb.

How It Works

ODB changes batching without changing your model forward path:

  1. DataLoader workers produce fully processed single samples.
  2. ODB buffers the samples and observes their true runtime lengths.
  3. Samples with similar length are grouped under a token budget.
  4. DDP ranks exchange lightweight grouping metadata.
  5. Your original collate_fn collates each dynamic group.
  6. The trainer consumes ODBStepInfo for progress and loss scaling.

The resulting step size varies in samples but is much more stable in tokens. That is the useful operating point for long-tail instruction and multimodal training.

API At A Glance

odb.ODBDataLoader(dataset, token_budget=..., **dataloader_kwargs)
odb.apply(dataloader, token_budget=..., loss_scaling="exact")
odb.pop_step_info(batch, loss_scaling="exact")
odb.integrations.hf.configure_trainer(...)
odb.integrations.hf.ODBTrainerMixin
odb.integrations.hf.ODBTrainer
odb.integrations.accelerate.configure_accelerator(...)
odb.integrations.lightning.configure_lightning_module(...)

Key Parameters

Parameter Meaning
token_budget Target maximum total input length per dynamic group. Legacy name: max_input_length.
loss_scaling "none", "approx", or "exact". Use "exact" for strict token-weighted DDP loss scaling.
join Enables the ODB join-mode protocol; defaults to True. Legacy name: join_mode.
buffer_size Number of prefetched single samples available to the online grouping window.
max_patches Optional multimodal compute cap for image-heavy workloads.

Benchmark Snapshot

Representative 8xH20 Qwen3-VL full fine-tuning results:

Workload Length CV Standard ODB Speedup
UltraChat 200K, 8B Full FT 0.48 5.77 sam/s 10.23 sam/s 1.77x
LLaVA 150K, 8B Full FT 0.29 14.38 sam/s 24.87 sam/s 1.73x
ShareGPT4o 57K, 8B Full FT 1.00 2.37 sam/s 5.83 sam/s 2.46x

Quality is reported alongside throughput in the paper experiments. The intended claim is a better throughput-quality operating point under variable-length training, not identical optimizer-update geometry.

See docs/benchmarks.md for reporting policy and benchmark notes.

Integration Checklist

Use this as a quick audit before opening a PR in a training stack:

  • DataLoader emits one fully processed sample at a time: batch_size=1.
  • DataLoader uses worker prefetching: num_workers > 0.
  • ODB is applied after the framework has selected sampler/shuffle behavior.
  • Trainer removes ODB metadata before model forward.
  • Trainer uses info.loss_scale when DDP ranks can process different local sample/token counts.
  • Trainer progresses/stops by emitted samples when doing epoch-based training.
  • Default join=True is paired with DDP Join or the framework's equivalent uneven-input handling; use join=False only when that runtime support is not available.

Project Layout

src/odb/                     # core package
src/odb/integrations/        # trainer adapters
examples/                    # minimal PyTorch/HF examples and synthetic benchmarks
docs/integration-guides/     # framework-specific integration notes
docs/benchmarks.md           # benchmark reporting policy
agent-skills/                # Codex / Claude Code assisted integration skill

Build And Verify

python -m pip install -U build twine
python -m build
python -m twine check dist/*
python -m pip install dist/online_dynamic_batching-*.whl
python -c "import odb; print(odb.__version__)"
pytest

Engineering Roadmap

ODB's roadmap is focused on runtime capabilities: stronger distributed-training semantics, clearer trainer interfaces, additional batching policies, structured observability, and reproducible benchmarking. See ROADMAP.md.

Requirements

  • Python 3.9+
  • PyTorch 2.0+
  • Optional: transformers>=4.40 for HuggingFace Trainer integration

Citation

If you find ODB useful, please cite the technical report:

@techreport{odb2025,
  title = {Online Dynamic Batching: Adaptive Batch Sizing for Variable-Length Sequence Training},
  year = {2025}
}

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

online_dynamic_batching-0.1.2.tar.gz (93.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

online_dynamic_batching-0.1.2-py3-none-any.whl (56.8 kB view details)

Uploaded Python 3

File details

Details for the file online_dynamic_batching-0.1.2.tar.gz.

File metadata

  • Download URL: online_dynamic_batching-0.1.2.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for online_dynamic_batching-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c07a57bbcca299b14f488d89d22dc98fdcd6b7f1a9201936bd91289dbf0fe937
MD5 b17e52fd205df5a583c884cf3d95d705
BLAKE2b-256 6362885d735649b7c64d164bb191d1dd501d1296662583b88f5ce27bd4077644

See more details on using hashes here.

File details

Details for the file online_dynamic_batching-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for online_dynamic_batching-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 78378057d713be32d8734fa7ac8bc7ab9ed93649bc989e5ba12fccc25f236fd2
MD5 a3109bc2fabca685eb8cc083f2e5ed09
BLAKE2b-256 5b3ae3844ed1e942d90e129db63338ba340354d71cf81b1ca39c7aea5454d407

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page