Skip to main content

Python client for Anchor — PaliGemma2 multi-LoRA vision inference

Project description

Anchor

PaliGemma2 multi-LoRA serving with OpenAI-compatible API.

Load multiple LoRA adapters once. Switch between them at inference time — 216ms, no reload.

Open In Colab PyPI License

                    ┌─────────────────────────────────┐
  Request           │           Anchor                │
  model="short" ───▶│                                 │
                    │  PaliGemma2 base  (VRAM)        │
                    │  ├── adapter: missing_hole  ◀─  │──▶ "YES / NO"
                    │  ├── adapter: open_circuit  ◀─  │
                    │  ├── adapter: short  ◀──────── ─│  pointer swap
                    │  ├── adapter: mouse_bite    ◀─  │     216ms
                    │  └── adapter: spur          ◀─  │
                    └─────────────────────────────────┘
# Call the open_circuit adapter
curl https://your-anchor-endpoint/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "open_circuit",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "text", "text": "Does this PCB have an open circuit defect? Answer YES or NO."}
      ]
    }],
    "max_tokens": 3
  }'

Python Client

pip install anchor-vision
from anchor_vision import AnchorClient

client = AnchorClient("https://your-anchor.run.app")
result = client.inspect("image.jpg", adapter="open_circuit")
print(result.answer)      # "YES"
print(result.latency_ms)  # 216

Quick Demo

# 1. Clone and build
git clone https://github.com/recursia-lab/anchor
docker build -t anchor .

# 2. Run (mount your model and adapters)
docker run --gpus all \
  -v /path/to/paligemma2:/model \
  -v /path/to/lora:/lora \
  -p 8080:8080 anchor

# 3. Query any adapter by name
curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"open_circuit","messages":[{"role":"user","content":[
    {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,<b64>"}},
    {"type":"text","text":"Defect present? YES or NO."}
  ]}],"max_tokens":3}'
# → {"choices":[{"message":{"content":"YES"}}],"usage":{"latency_ms":216}}

Why Anchor

Most serving frameworks load LoRA adapters per request — fetching from disk or swapping from CPU at inference time. For production workloads where multiple fine-tuned adapters are in active use, this adds hundreds of milliseconds per request.

Anchor takes a different approach: all adapters live in GPU memory simultaneously. Switching is a pointer swap — 216ms, no disk I/O, no model reload.

Framework PaliGemma2 LoRA Multi-adapter Dynamic switch
Anchor ✅ all in VRAM ✅ 216ms
vLLM ✅ (since v0.7.0) per-request load
SGLang 🚧 PR #24034
Unsloth 🚧 PR #5218 fine-tune only
Ollama
TGI / LoRAX

When to use Anchor: production scenarios with 2–10 adapters that all need low-latency access. When one adapter is enough, vLLM works fine.

Architecture

/model          ← PaliGemma2 base (bfloat16, device_map=auto)
/lora/
  adapter_1/    ← PEFT LoRA adapter (loaded via load_adapter)
  adapter_2/
  adapter_3/

Request: model="adapter_1"  →  set_adapter("adapter_1")  →  generate()  →  216ms
Request: model="adapter_2"  →  set_adapter("adapter_2")  →  generate()  →  216ms
Request: model="base"       →  disable_adapters()         →  generate()

All adapters stay in VRAM. Switching is just a pointer swap — no disk I/O, no model reload.

Quick Start

Python (pip)

pip install anchor-vision
from anchor_vision import AnchorClient

client = AnchorClient("https://your-anchor.run.app")

# List loaded adapters
print(client.list_adapters())  # ["open_circuit", "short", "mouse_bite", ...]

# Run inference
result = client.inspect(
    "image.jpg",
    adapter="open_circuit",
    prompt="Is there an open circuit defect? Answer YES or NO.",
)
print(result)  # "YES"

LangChain

pip install 'anchor-vision[langchain]'
from anchor_vision import AnchorVisionTool

tool = AnchorVisionTool(
    endpoint="https://your-anchor.run.app",
    adapter="open_circuit",
    prompt="Is there a defect? Answer YES or NO.",
)

result = tool.invoke({"image_path": "image.jpg"})
# → "YES"

# Drop into any LangChain agent
# agent = initialize_agent(tools=[tool], ...)

Local (GPU required)

# 1. Clone
git clone https://github.com/recursia-lab/anchor
cd anchor

# 2. Install
pip install -r requirements.txt

# 3. Place model and adapters
#    /model   → PaliGemma2 weights (from HuggingFace or your fine-tune)
#    /lora/   → one subfolder per adapter

MODEL_PATH=/path/to/model LORA_PATH=/path/to/lora python server.py

Docker

docker build -t anchor .
docker run --gpus all \
  -v /path/to/model:/model \
  -v /path/to/lora:/lora \
  -p 8080:8080 \
  anchor

Google Cloud Run (GPU)

# Edit cloudbuild.yaml substitutions, then:
gcloud builds submit --config cloudbuild.yaml

gcloud beta run deploy anchor \
  --image YOUR_IMAGE \
  --region us-east4 \
  --gpu=1 --gpu-type=nvidia-l4 \
  --cpu=8 --memory=32Gi \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --min-instances=0 \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=1"

API

GET /health

{"status": "ok", "adapters": ["open_circuit", "short", "mouse_bite"]}

GET /v1/models

Lists all loaded adapters in OpenAI format.

POST /v1/chat/completions

OpenAI-compatible. Use model field to select adapter.

Request:

{
  "model": "open_circuit",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<b64>"}},
      {"type": "text", "text": "<your prompt>"}
    ]
  }],
  "max_tokens": 10
}

Response:

{
  "model": "open_circuit",
  "choices": [{"message": {"role": "assistant", "content": "YES"}}],
  "usage": {"prompt_tokens": 271, "completion_tokens": 1, "latency_ms": 216}
}

Environment Variables

Variable Default Description
MODEL_PATH /model Path to PaliGemma2 base model
LORA_PATH /lora Directory of LoRA adapter subfolders
PORT 8080 HTTP port

Performance (Google Cloud Run, NVIDIA L4)

Metric Value
Cold start (model load) ~3 min
Adapter switch latency 216ms
Concurrent adapters in VRAM 6 (tested)
GPU memory (6 PCB adapters) ~12GB / 24GB L4

Ecosystem

  • Python client: pip install anchor-vision
  • Adapters: recursia-lab/paligemma2-adapters — community LoRA adapter index
  • SGLang: PR #24034 — native PaliGemma2 LoRA support (pending merge)
  • Unsloth: PR #5218 — PaliGemma2 fine-tuning support (pending merge)
  • vLLM: supported since v0.7.0

Roadmap

  • PEFT multi-LoRA server (this repo)
  • Google Cloud Run deployment
  • SGLang PR (#24034)
  • Unsloth PR (#5218)
  • Python client (pip install anchor-vision)
  • LangChain integration
  • Colab quickstart notebook
  • PyPI publish
  • Ollama support (blocked by llama.cpp SigLIP encoder)
  • AWQ quantization (2-5x speedup)
  • Continuous batching

About

Built by Recursia Lab for industrial visual inspection.

PaliGemma2 is a vision-language model by Google DeepMind.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchor_vision-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anchor_vision-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file anchor_vision-0.1.0.tar.gz.

File metadata

  • Download URL: anchor_vision-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for anchor_vision-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dac76e77da5974b31ce95a8746f8579049a60b268a9c3bd39203d3433b64d6f8
MD5 1cbf00dbafc94197709b17198ccad925
BLAKE2b-256 fc90daebf332e90427c09d406b20e669b9a846a530cf6e40c473fd455ae5814a

See more details on using hashes here.

File details

Details for the file anchor_vision-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: anchor_vision-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for anchor_vision-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e964aad168876101912066688066fbf0fc4ff7a3f6efee65cc99e6aff77ba39b
MD5 83ee6d142b7d1dc858eafcdd492c89c9
BLAKE2b-256 8496f5e0c8bde24c273d4cb5f82f37018043d61501398c1d3e47f9c7fa76d508

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page