Python client for Anchor — PaliGemma2 multi-LoRA vision inference

These details have not been verified by PyPI

Project links

Project description

Anchor

PaliGemma2 multi-LoRA serving with OpenAI-compatible API.

Load multiple LoRA adapters once. Switch between them at inference time — 216ms, no reload.

                    ┌─────────────────────────────────┐
  Request           │           Anchor                │
  model="short" ───▶│                                 │
                    │  PaliGemma2 base  (VRAM)        │
                    │  ├── adapter: missing_hole  ◀─  │──▶ "YES / NO"
                    │  ├── adapter: open_circuit  ◀─  │
                    │  ├── adapter: short  ◀──────── ─│  pointer swap
                    │  ├── adapter: mouse_bite    ◀─  │     216ms
                    │  └── adapter: spur          ◀─  │
                    └─────────────────────────────────┘

# Call the open_circuit adapter
curl https://your-anchor-endpoint/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "open_circuit",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "text", "text": "Does this PCB have an open circuit defect? Answer YES or NO."}
      ]
    }],
    "max_tokens": 3
  }'

Python Client

pip install anchor-vision

from anchor_vision import AnchorClient

client = AnchorClient("https://your-anchor.run.app")
result = client.inspect("image.jpg", adapter="open_circuit")
print(result.answer)      # "YES"
print(result.latency_ms)  # 216

Quick Demo

# 1. Clone and build
git clone https://github.com/recursia-lab/anchor
docker build -t anchor .

# 2. Run (mount your model and adapters)
docker run --gpus all \
  -v /path/to/paligemma2:/model \
  -v /path/to/lora:/lora \
  -p 8080:8080 anchor

# 3. Query any adapter by name
curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"open_circuit","messages":[{"role":"user","content":[
    {"type":"image_url","image_url":{"url":"data:image/jpeg;base64,<b64>"}},
    {"type":"text","text":"Defect present? YES or NO."}
  ]}],"max_tokens":3}'
# → {"choices":[{"message":{"content":"YES"}}],"usage":{"latency_ms":216}}

Why Anchor

Most serving frameworks load LoRA adapters per request — fetching from disk or swapping from CPU at inference time. For production workloads where multiple fine-tuned adapters are in active use, this adds hundreds of milliseconds per request.

Anchor takes a different approach: all adapters live in GPU memory simultaneously. Switching is a pointer swap — 216ms, no disk I/O, no model reload.

Framework	PaliGemma2 LoRA	Multi-adapter	Dynamic switch
Anchor	✅	✅ all in VRAM	✅ 216ms
vLLM	✅ (since v0.7.0)	✅	per-request load
SGLang	🚧 PR #24034	—	—
Unsloth	🚧 PR #5218	—	fine-tune only
Ollama	❌	—	—
TGI / LoRAX	❌	—	—

When to use Anchor: production scenarios with 2–10 adapters that all need low-latency access. When one adapter is enough, vLLM works fine.

Architecture

/model          ← PaliGemma2 base (bfloat16, device_map=auto)
/lora/
  adapter_1/    ← PEFT LoRA adapter (loaded via load_adapter)
  adapter_2/
  adapter_3/

Request: model="adapter_1"  →  set_adapter("adapter_1")  →  generate()  →  216ms
Request: model="adapter_2"  →  set_adapter("adapter_2")  →  generate()  →  216ms
Request: model="base"       →  disable_adapters()         →  generate()

All adapters stay in VRAM. Switching is just a pointer swap — no disk I/O, no model reload.

Quick Start

Python (pip)

pip install anchor-vision

from anchor_vision import AnchorClient

client = AnchorClient("https://your-anchor.run.app")

# List loaded adapters
print(client.list_adapters())  # ["open_circuit", "short", "mouse_bite", ...]

# Run inference
result = client.inspect(
    "image.jpg",
    adapter="open_circuit",
    prompt="Is there an open circuit defect? Answer YES or NO.",
)
print(result)  # "YES"

LangChain

pip install 'anchor-vision[langchain]'

from anchor_vision import AnchorVisionTool

tool = AnchorVisionTool(
    endpoint="https://your-anchor.run.app",
    adapter="open_circuit",
    prompt="Is there a defect? Answer YES or NO.",
)

result = tool.invoke({"image_path": "image.jpg"})
# → "YES"

# Drop into any LangChain agent
# agent = initialize_agent(tools=[tool], ...)

Local (GPU required)

# 1. Clone
git clone https://github.com/recursia-lab/anchor
cd anchor

# 2. Install
pip install -r requirements.txt

# 3. Place model and adapters
#    /model   → PaliGemma2 weights (from HuggingFace or your fine-tune)
#    /lora/   → one subfolder per adapter

MODEL_PATH=/path/to/model LORA_PATH=/path/to/lora python server.py

Docker

docker build -t anchor .
docker run --gpus all \
  -v /path/to/model:/model \
  -v /path/to/lora:/lora \
  -p 8080:8080 \
  anchor

Google Cloud Run (GPU)

# Edit cloudbuild.yaml substitutions, then:
gcloud builds submit --config cloudbuild.yaml

gcloud beta run deploy anchor \
  --image YOUR_IMAGE \
  --region us-east4 \
  --gpu=1 --gpu-type=nvidia-l4 \
  --cpu=8 --memory=32Gi \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --min-instances=0 \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=1"

API

`GET /health`

{"status": "ok", "adapters": ["open_circuit", "short", "mouse_bite"]}

`GET /v1/models`

Lists all loaded adapters in OpenAI format.

`POST /v1/chat/completions`

OpenAI-compatible. Use model field to select adapter.

Request:

{
  "model": "open_circuit",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<b64>"}},
      {"type": "text", "text": "<your prompt>"}
    ]
  }],
  "max_tokens": 10
}

Response:

{
  "model": "open_circuit",
  "choices": [{"message": {"role": "assistant", "content": "YES"}}],
  "usage": {"prompt_tokens": 271, "completion_tokens": 1, "latency_ms": 216}
}

Environment Variables

Variable	Default	Description
`MODEL_PATH`	`/model`	Path to PaliGemma2 base model
`LORA_PATH`	`/lora`	Directory of LoRA adapter subfolders
`PORT`	`8080`	HTTP port

Performance (Google Cloud Run, NVIDIA L4)

Metric	Value
Cold start (model load)	~3 min
Adapter switch latency	216ms
Concurrent adapters in VRAM	6 (tested)
GPU memory (6 PCB adapters)	~12GB / 24GB L4

Ecosystem

Python client: pip install anchor-vision
Adapters: recursia-lab/paligemma2-adapters — community LoRA adapter index
SGLang: PR #24034 — native PaliGemma2 LoRA support (pending merge)
Unsloth: PR #5218 — PaliGemma2 fine-tuning support (pending merge)
vLLM: supported since v0.7.0

Roadmap

PEFT multi-LoRA server (this repo)
Google Cloud Run deployment
SGLang PR (#24034)
Unsloth PR (#5218)
Python client (pip install anchor-vision)
LangChain integration
Colab quickstart notebook
PyPI publish
Ollama support (blocked by llama.cpp SigLIP encoder)
AWQ quantization (2-5x speedup)
Continuous batching

About

Built by Recursia Lab for industrial visual inspection.

PaliGemma2 is a vision-language model by Google DeepMind.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchor_vision-0.1.0.tar.gz (10.0 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anchor_vision-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file anchor_vision-0.1.0.tar.gz.

File metadata

Download URL: anchor_vision-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 10.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for anchor_vision-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dac76e77da5974b31ce95a8746f8579049a60b268a9c3bd39203d3433b64d6f8`
MD5	`1cbf00dbafc94197709b17198ccad925`
BLAKE2b-256	`fc90daebf332e90427c09d406b20e669b9a846a530cf6e40c473fd455ae5814a`

See more details on using hashes here.

File details

Details for the file anchor_vision-0.1.0-py3-none-any.whl.

File metadata

Download URL: anchor_vision-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for anchor_vision-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e964aad168876101912066688066fbf0fc4ff7a3f6efee65cc99e6aff77ba39b`
MD5	`83ee6d142b7d1dc858eafcdd492c89c9`
BLAKE2b-256	`8496f5e0c8bde24c273d4cb5f82f37018043d61501398c1d3e47f9c7fa76d508`

See more details on using hashes here.

anchor-vision 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Anchor

Python Client

Quick Demo

Why Anchor

Architecture

Quick Start

Python (pip)

LangChain

Local (GPU required)

Docker

Google Cloud Run (GPU)

API

GET /health

GET /v1/models

POST /v1/chat/completions

Environment Variables

Performance (Google Cloud Run, NVIDIA L4)

Ecosystem

Roadmap

About

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`GET /health`

`GET /v1/models`

`POST /v1/chat/completions`