Skip to main content

Batched multi-LoRA inference runtime for PyTorch models.

Project description

PolyLoRA

Minimal PyTorch runtime for batched LoRA inference where each row can use a different adapter.

PolyLoRA wraps an existing torch.nn.Module, replaces selected nn.Linear layers, and serves PEFT LoRA adapters from CPU, GPU, and optional disk caches.

Install

pip install .

With PEFT loading support:

pip install '.[peft]'

Usage

from polylora import PolyLoraConfig, PolyLoraModel

model = PolyLoraModel(
    base_model,
    PolyLoraConfig(
        max_gpu_adapters=4,
        max_rank=16,
        target_modules=["query_proj", "key_proj", "value_proj", "dense"],
    ),
).eval()

model.load_adapter_from_disk("legal", "./adapters/legal")
model.load_adapter_from_disk("finance", "./adapters/finance")

outputs = model(**batch, adapter_ids=["legal", "finance"])

Omit adapter_ids to run the base model. Use __base__ for rows that should skip LoRA inside a mixed batch.

Caches

PolyLoRA uses three adapter tiers:

  • GPU cache: fixed-size adapter slots for the active batch. Slot 0 is reserved for __base__, so non-adapter rows share the same execution path.
  • CPU cache: LRU store for loaded adapter weights. GPU evictions can reload from CPU without touching disk.
  • Disk cache: optional bounded PEFT adapter directory cache. CPU misses can reload adapters from this cold layer.

This makes small hot sets fast while still allowing a larger adapter catalog than GPU memory can hold.

Kernels

On CUDA, PolyLoRA uses Triton SGMV kernels for the LoRA A and B projections:

  • Mixed batches can contain different adapter ids, including __base__ rows.
  • Different adapters may use different ranks, up to max_rank.
  • Rank-0 rows skip adapter work, which is how base-only rows and missing layer weights are represented.
  • The B projection fuses scaling and add-back into the base linear output.
  • The implementation falls back to a PyTorch reference path on CPU or when Triton is disabled.

Adapter Layouts

Adapters do not need to cover every wrapped layer. If a model is wrapped with a larger target_modules set and an adapter only contains LoRA weights for some of those layers, missing layers are treated as rank-0 no-ops for that adapter.

PolyLoRA rejects adapters with weights outside the configured module set, which keeps mixed adapters predictable when different adapters target different subsets of the model.

Notes

  • Supports standard PEFT LoRA adapters for inference.
  • Does not support LoRA dropout, DoRA, RS-LoRA, or LoRA bias.
  • Attention masks must be right padded when enforce_right_padding=True.

Development

pip install -e '.[dev]'
pytest tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polylora-0.1.1.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polylora-0.1.1-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file polylora-0.1.1.tar.gz.

File metadata

  • Download URL: polylora-0.1.1.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for polylora-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bf06c7081999f06075a1faca1bf35d3404af69d7c7d6354a933b9c7e4846327a
MD5 4c56730dc0cc4206dccd1d306fb4c19d
BLAKE2b-256 a6285ea8c6a67e57b90ce9cb300bab0f41645ad644c5850e0fcb31fe9db4c098

See more details on using hashes here.

File details

Details for the file polylora-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: polylora-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for polylora-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a7fa4ee8b891a57d96ae595da562fd3a88f354aa44fe365a1c699c017a5f1ec
MD5 b3842b4948e6287e333560b7f0c6c477
BLAKE2b-256 7c7801e16c83aa8ccb016d3046899e5525f56f0c3aa75243ff93452989ad5b29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page