Python client for Anchor — PaliGemma2 multi-LoRA vision inference
Project description
Anchor
PaliGemma2 multi-LoRA serving with OpenAI-compatible API.
Load multiple LoRA adapters once. Switch between them at inference time — 216ms, no reload.
┌─────────────────────────────────┐
Request │ Anchor │
model="short" ───▶│ │
│ PaliGemma2 base (VRAM) │
│ ├── adapter: missing_hole ◀─ │──▶ "YES / NO"
│ ├── adapter: open_circuit ◀─ │
│ ├── adapter: short ◀──────── ─│ pointer swap
│ ├── adapter: mouse_bite ◀─ │ 216ms
│ └── adapter: spur ◀─ │
└─────────────────────────────────┘
# Call the open_circuit adapter
curl https://your-anchor-endpoint/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "open_circuit",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "text", "text": "Does this PCB have an open circuit defect? Answer YES or NO."}
]
}],
"max_tokens": 3
}'
Python Client
pip install anchor-vision
from anchor_vision import AnchorClient
client = AnchorClient("https://your-anchor.run.app")
result = client.inspect("image.jpg", adapter="open_circuit")
print(result.answer) # "YES"
print(result.latency_ms) # 216
Quick Demo
# 1. Clone and build
git clone https://github.com/recursia-lab/anchor
docker build -t anchor .
# 2. Run (mount your model and adapters)
docker run --gpus all \
-v /path/to/paligemma2:/model \
-v /path/to/lora:/lora \
-p 8080:8080 anchor
# 3. Query any adapter by name
curl http://localhost:8080/v1/chat/completions \
-d '{"model":"open_circuit","messages":[{"role":"user","content":[
{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,<b64>"}},
{"type":"text","text":"Defect present? YES or NO."}
]}],"max_tokens":3}'
# → {"choices":[{"message":{"content":"YES"}}],"usage":{"latency_ms":216}}
Why Anchor
Most serving frameworks load LoRA adapters per request — fetching from disk or swapping from CPU at inference time. For production workloads where multiple fine-tuned adapters are in active use, this adds hundreds of milliseconds per request.
Anchor takes a different approach: all adapters live in GPU memory simultaneously. Switching is a pointer swap — 216ms, no disk I/O, no model reload.
| Framework | PaliGemma2 LoRA | Multi-adapter | Dynamic switch |
|---|---|---|---|
| Anchor | ✅ | ✅ all in VRAM | ✅ 216ms |
| vLLM | ✅ (since v0.7.0) | ✅ | per-request load |
| SGLang | 🚧 PR #24034 | — | — |
| Unsloth | 🚧 PR #5218 | — | fine-tune only |
| Ollama | ❌ | — | — |
| TGI / LoRAX | ❌ | — | — |
When to use Anchor: production scenarios with 2–10 adapters that all need low-latency access. When one adapter is enough, vLLM works fine.
Architecture
/model ← PaliGemma2 base (bfloat16, device_map=auto)
/lora/
adapter_1/ ← PEFT LoRA adapter (loaded via load_adapter)
adapter_2/
adapter_3/
Request: model="adapter_1" → set_adapter("adapter_1") → generate() → 216ms
Request: model="adapter_2" → set_adapter("adapter_2") → generate() → 216ms
Request: model="base" → disable_adapters() → generate()
All adapters stay in VRAM. Switching is just a pointer swap — no disk I/O, no model reload.
Quick Start
Python (pip)
pip install anchor-vision
from anchor_vision import AnchorClient
client = AnchorClient("https://your-anchor.run.app")
# List loaded adapters
print(client.list_adapters()) # ["open_circuit", "short", "mouse_bite", ...]
# Run inference
result = client.inspect(
"image.jpg",
adapter="open_circuit",
prompt="Is there an open circuit defect? Answer YES or NO.",
)
print(result) # "YES"
LangChain
pip install 'anchor-vision[langchain]'
from anchor_vision import AnchorVisionTool
tool = AnchorVisionTool(
endpoint="https://your-anchor.run.app",
adapter="open_circuit",
prompt="Is there a defect? Answer YES or NO.",
)
result = tool.invoke({"image_path": "image.jpg"})
# → "YES"
# Drop into any LangChain agent
# agent = initialize_agent(tools=[tool], ...)
Local (GPU required)
# 1. Clone
git clone https://github.com/recursia-lab/anchor
cd anchor
# 2. Install
pip install -r requirements.txt
# 3. Place model and adapters
# /model → PaliGemma2 weights (from HuggingFace or your fine-tune)
# /lora/ → one subfolder per adapter
MODEL_PATH=/path/to/model LORA_PATH=/path/to/lora python server.py
Docker
docker build -t anchor .
docker run --gpus all \
-v /path/to/model:/model \
-v /path/to/lora:/lora \
-p 8080:8080 \
anchor
Google Cloud Run (GPU)
# Edit cloudbuild.yaml substitutions, then:
gcloud builds submit --config cloudbuild.yaml
gcloud beta run deploy anchor \
--image YOUR_IMAGE \
--region us-east4 \
--gpu=1 --gpu-type=nvidia-l4 \
--cpu=8 --memory=32Gi \
--no-cpu-throttling \
--no-gpu-zonal-redundancy \
--min-instances=0 \
--startup-probe="tcpSocket.port=8080,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=1"
API
GET /health
{"status": "ok", "adapters": ["open_circuit", "short", "mouse_bite"]}
GET /v1/models
Lists all loaded adapters in OpenAI format.
POST /v1/chat/completions
OpenAI-compatible. Use model field to select adapter.
Request:
{
"model": "open_circuit",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<b64>"}},
{"type": "text", "text": "<your prompt>"}
]
}],
"max_tokens": 10
}
Response:
{
"model": "open_circuit",
"choices": [{"message": {"role": "assistant", "content": "YES"}}],
"usage": {"prompt_tokens": 271, "completion_tokens": 1, "latency_ms": 216}
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
/model |
Path to PaliGemma2 base model |
LORA_PATH |
/lora |
Directory of LoRA adapter subfolders |
PORT |
8080 |
HTTP port |
Performance (Google Cloud Run, NVIDIA L4)
| Metric | Value |
|---|---|
| Cold start (model load) | ~3 min |
| Adapter switch latency | 216ms |
| Concurrent adapters in VRAM | 6 (tested) |
| GPU memory (6 PCB adapters) | ~12GB / 24GB L4 |
Ecosystem
- Python client:
pip install anchor-vision - Adapters: recursia-lab/paligemma2-adapters — community LoRA adapter index
- SGLang: PR #24034 — native PaliGemma2 LoRA support (pending merge)
- Unsloth: PR #5218 — PaliGemma2 fine-tuning support (pending merge)
- vLLM: supported since v0.7.0
Roadmap
- PEFT multi-LoRA server (this repo)
- Google Cloud Run deployment
- SGLang PR (#24034)
- Unsloth PR (#5218)
- Python client (
pip install anchor-vision) - LangChain integration
- Colab quickstart notebook
- PyPI publish
- Ollama support (blocked by llama.cpp SigLIP encoder)
- AWQ quantization (2-5x speedup)
- Continuous batching
About
Built by Recursia Lab for industrial visual inspection.
PaliGemma2 is a vision-language model by Google DeepMind.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anchor_vision-0.1.0.tar.gz.
File metadata
- Download URL: anchor_vision-0.1.0.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dac76e77da5974b31ce95a8746f8579049a60b268a9c3bd39203d3433b64d6f8
|
|
| MD5 |
1cbf00dbafc94197709b17198ccad925
|
|
| BLAKE2b-256 |
fc90daebf332e90427c09d406b20e669b9a846a530cf6e40c473fd455ae5814a
|
File details
Details for the file anchor_vision-0.1.0-py3-none-any.whl.
File metadata
- Download URL: anchor_vision-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e964aad168876101912066688066fbf0fc4ff7a3f6efee65cc99e6aff77ba39b
|
|
| MD5 |
83ee6d142b7d1dc858eafcdd492c89c9
|
|
| BLAKE2b-256 |
8496f5e0c8bde24c273d4cb5f82f37018043d61501398c1d3e47f9c7fa76d508
|