Batched multi-LoRA inference runtime for PyTorch models.
Project description
PolyLoRA
Minimal PyTorch runtime for batched LoRA inference where each row can use a different adapter.
PolyLoRA wraps an existing torch.nn.Module, replaces selected nn.Linear layers, and serves PEFT LoRA adapters from CPU, GPU, and optional disk caches.
Install
pip install .
With PEFT loading support:
pip install '.[peft]'
Usage
from polylora import CustomLoraConfig, CustomPeftModel
model = CustomPeftModel(
base_model,
CustomLoraConfig(
max_gpu_adapters=4,
max_rank=16,
target_modules=["query_proj", "key_proj", "value_proj", "dense"],
),
).eval()
model.load_adapter_from_disk("legal", "./adapters/legal")
model.load_adapter_from_disk("finance", "./adapters/finance")
outputs = model(**batch, adapter_ids=["legal", "finance"])
Omit adapter_ids to run the base model. Use __base__ for rows that should skip LoRA inside a mixed batch.
Caches
PolyLoRA uses three adapter tiers:
- GPU cache: fixed-size adapter slots for the active batch. Slot
0is reserved for__base__, so non-adapter rows share the same execution path. - CPU cache: LRU store for loaded adapter weights. GPU evictions can reload from CPU without touching disk.
- Disk cache: optional bounded PEFT adapter directory cache. CPU misses can reload adapters from this cold layer.
This makes small hot sets fast while still allowing a larger adapter catalog than GPU memory can hold.
Kernels
On CUDA, PolyLoRA uses Triton SGMV kernels for the LoRA A and B projections:
- Mixed batches can contain different adapter ids, including
__base__rows. - Different adapters may use different ranks, up to
max_rank. - Rank-0 rows skip adapter work, which is how base-only rows and missing layer weights are represented.
- The
Bprojection fuses scaling and add-back into the base linear output. - The implementation falls back to a PyTorch reference path on CPU or when Triton is disabled.
Adapter Layouts
Adapters do not need to cover every wrapped layer. If a model is wrapped with a larger target_modules set and an adapter only contains LoRA weights for some of those layers, missing layers are treated as rank-0 no-ops for that adapter.
PolyLoRA rejects adapters with weights outside the configured module set, which keeps mixed adapters predictable when different adapters target different subsets of the model.
Notes
- Supports standard PEFT LoRA adapters for inference.
- Does not support LoRA dropout, DoRA, RS-LoRA, or LoRA bias.
- Attention masks must be right padded when
enforce_right_padding=True.
Development
pip install -e '.[dev]'
pytest tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polylora-0.1.0.tar.gz.
File metadata
- Download URL: polylora-0.1.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f8f0a764575e93640b29c63f3d6e8734cad9aa63bbeadf455e2e3ec73f60327
|
|
| MD5 |
d185d65287219165712f6c5e0f2dfa8c
|
|
| BLAKE2b-256 |
5e525b8c5bbb7958912d9e5391605bb2eb9b36cd15ef5ade30081b28f5ec7c9d
|
File details
Details for the file polylora-0.1.0-py3-none-any.whl.
File metadata
- Download URL: polylora-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05711141745ae2075c9935c00a5c66a43c98256200169bee0d8cd9c2f8b25834
|
|
| MD5 |
a81cc453d1fafd3bf7a81f78539e2417
|
|
| BLAKE2b-256 |
a7fdf306e8d85504c5a8dbdd6ce700b1663b66e3b7a9101221923481d4b8fe90
|