Skip to main content

Batched multi-LoRA inference runtime for PyTorch models.

Project description

PolyLoRA

Minimal PyTorch runtime for batched LoRA inference where each row can use a different adapter.

PolyLoRA wraps an existing torch.nn.Module, replaces selected nn.Linear layers, and serves PEFT LoRA adapters from CPU, GPU, and optional disk caches.

Install

pip install .

With PEFT loading support:

pip install '.[peft]'

Usage

from polylora import CustomLoraConfig, CustomPeftModel

model = CustomPeftModel(
    base_model,
    CustomLoraConfig(
        max_gpu_adapters=4,
        max_rank=16,
        target_modules=["query_proj", "key_proj", "value_proj", "dense"],
    ),
).eval()

model.load_adapter_from_disk("legal", "./adapters/legal")
model.load_adapter_from_disk("finance", "./adapters/finance")

outputs = model(**batch, adapter_ids=["legal", "finance"])

Omit adapter_ids to run the base model. Use __base__ for rows that should skip LoRA inside a mixed batch.

Caches

PolyLoRA uses three adapter tiers:

  • GPU cache: fixed-size adapter slots for the active batch. Slot 0 is reserved for __base__, so non-adapter rows share the same execution path.
  • CPU cache: LRU store for loaded adapter weights. GPU evictions can reload from CPU without touching disk.
  • Disk cache: optional bounded PEFT adapter directory cache. CPU misses can reload adapters from this cold layer.

This makes small hot sets fast while still allowing a larger adapter catalog than GPU memory can hold.

Kernels

On CUDA, PolyLoRA uses Triton SGMV kernels for the LoRA A and B projections:

  • Mixed batches can contain different adapter ids, including __base__ rows.
  • Different adapters may use different ranks, up to max_rank.
  • Rank-0 rows skip adapter work, which is how base-only rows and missing layer weights are represented.
  • The B projection fuses scaling and add-back into the base linear output.
  • The implementation falls back to a PyTorch reference path on CPU or when Triton is disabled.

Adapter Layouts

Adapters do not need to cover every wrapped layer. If a model is wrapped with a larger target_modules set and an adapter only contains LoRA weights for some of those layers, missing layers are treated as rank-0 no-ops for that adapter.

PolyLoRA rejects adapters with weights outside the configured module set, which keeps mixed adapters predictable when different adapters target different subsets of the model.

Notes

  • Supports standard PEFT LoRA adapters for inference.
  • Does not support LoRA dropout, DoRA, RS-LoRA, or LoRA bias.
  • Attention masks must be right padded when enforce_right_padding=True.

Development

pip install -e '.[dev]'
pytest tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polylora-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polylora-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file polylora-0.1.0.tar.gz.

File metadata

  • Download URL: polylora-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for polylora-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1f8f0a764575e93640b29c63f3d6e8734cad9aa63bbeadf455e2e3ec73f60327
MD5 d185d65287219165712f6c5e0f2dfa8c
BLAKE2b-256 5e525b8c5bbb7958912d9e5391605bb2eb9b36cd15ef5ade30081b28f5ec7c9d

See more details on using hashes here.

File details

Details for the file polylora-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: polylora-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for polylora-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05711141745ae2075c9935c00a5c66a43c98256200169bee0d8cd9c2f8b25834
MD5 a81cc453d1fafd3bf7a81f78539e2417
BLAKE2b-256 a7fdf306e8d85504c5a8dbdd6ce700b1663b66e3b7a9101221923481d4b8fe90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page