Skip to main content

Multimodal demo retrieval using VLM embeddings for GUI automation

Project description

openadapt-retrieval

Multimodal demo retrieval using VLM embeddings for GUI automation.

Overview

openadapt-retrieval provides a unified interface for creating multimodal embeddings from screenshots and task descriptions, enabling semantic demo retrieval for GUI automation agents.

Key Features:

  • Multimodal Embeddings: Embed text, images, or both into a shared vector space
  • Qwen3-VL-Embedding Support: Primary embedder using Alibaba's state-of-the-art VLM
  • Matryoshka Representation Learning (MRL): Flexible embedding dimensions (512-8192)
  • FAISS Integration: Fast similarity search with support for large demo libraries
  • Persistence: Save and load indices with embeddings and metadata
  • CLI Interface: Easy command-line access for indexing and searching

Installation

# Basic installation
pip install openadapt-retrieval

# With GPU support
pip install openadapt-retrieval[gpu]

# With CLIP fallback embedder
pip install openadapt-retrieval[clip]

# All optional dependencies
pip install openadapt-retrieval[all]

For development:

git clone https://github.com/OpenAdaptAI/openadapt-retrieval.git
cd openadapt-retrieval
uv sync --all-extras

Quick Start

Python API

from openadapt_retrieval import MultimodalDemoRetriever, Qwen3VLEmbedder

# Initialize retriever
retriever = MultimodalDemoRetriever(
    embedding_dim=512,  # Use MRL for smaller storage
)

# Add demos (from your recording library)
for demo in demos:
    retriever.add_demo(
        demo_id=demo.id,
        task=demo.instruction,
        screenshot=demo.first_screenshot_path,
        metadata={"app": demo.app_name},
    )

# Build the index
retriever.build_index()

# Save for later use
retriever.save("/path/to/demo_index")

# Retrieve similar demos
results = retriever.retrieve(
    task="Disable Night Shift",
    screenshot="/path/to/current_screen.png",
    top_k=3,
)

for result in results:
    print(f"{result.demo_id}: {result.task} (score: {result.score:.3f})")

Using the Embedder Directly

from openadapt_retrieval.embeddings import Qwen3VLEmbedder

# Initialize embedder
embedder = Qwen3VLEmbedder(embedding_dim=512)

# Embed text only
text_emb = embedder.embed_text("Turn off Night Shift")

# Embed image only
img_emb = embedder.embed_image("/path/to/screenshot.png")

# Embed multimodal (recommended)
mm_emb = embedder.embed_multimodal(
    text="Turn off Night Shift",
    image="/path/to/screenshot.png",
)

# Compute similarity
similarity = embedder.cosine_similarity(query_emb, demo_emb)

CLI Usage

# Embed a single image
openadapt-retrieval embed --image screenshot.png --output embedding.npy

# Embed text + image
openadapt-retrieval embed --text "Turn off Night Shift" --image screenshot.png

# Build index from directory of demos
openadapt-retrieval index --demo-dir /path/to/demos --output demo_index/

# Search the index
openadapt-retrieval search --index demo_index/ --text "disable display setting" --top-k 5

# Search with screenshot
openadapt-retrieval search --index demo_index/ --text "disable display" --image current.png --top-k 3

Architecture

Embeddings Module

openadapt_retrieval/embeddings/
├── base.py       # BaseEmbedder abstract class
├── qwen3vl.py    # Qwen3-VL-Embedding implementation
├── clip.py       # CLIP fallback (lighter weight)
└── registry.py   # get_embedder() factory

Supported Models:

Model Embedding Dim VRAM Use Case
Alibaba-NLP/Qwen3-VL-Embedding 512-8192 (MRL) ~8GB Primary (best quality)
openai/clip-vit-large-patch14 768 ~2GB Fallback (lighter)

Retriever Module

openadapt_retrieval/retriever/
├── demo_retriever.py  # MultimodalDemoRetriever
├── index.py           # VectorIndex (FAISS wrapper)
└── reranker.py        # CrossEncoderReranker (optional)

Key Classes:

  • MultimodalDemoRetriever: Main interface for indexing and retrieving demos
  • VectorIndex: FAISS index wrapper with save/load support
  • CrossEncoderReranker: Optional two-stage retrieval with cross-attention

Storage Module

openadapt_retrieval/storage/
└── persistence.py  # EmbeddingStorage for save/load

Index Format:

demo_index/
├── index.json       # Metadata and configuration
├── embeddings.npy   # Embedding vectors (float32)
└── faiss.index      # FAISS index (optional, for large indices)

Configuration

Embedding Dimensions (MRL)

Qwen3-VL-Embedding supports Matryoshka Representation Learning for flexible dimensions:

# Full dimension (best quality)
embedder = Qwen3VLEmbedder(embedding_dim=None)  # Uses 8192 for full model

# Reduced dimensions (faster search, smaller storage)
embedder = Qwen3VLEmbedder(embedding_dim=512)   # Good balance
embedder = Qwen3VLEmbedder(embedding_dim=256)   # Faster, slightly lower quality

Device Selection

# Auto-detect (CUDA > MPS > CPU)
embedder = Qwen3VLEmbedder()

# Force specific device
embedder = Qwen3VLEmbedder(device="cuda:0")
embedder = Qwen3VLEmbedder(device="mps")  # Apple Silicon
embedder = Qwen3VLEmbedder(device="cpu")

Hardware Requirements

Qwen3-VL-Embedding

Component Minimum Recommended
GPU RTX 3060 (12GB) RTX 4090 (24GB)
VRAM 6 GB (FP16) 8 GB
RAM 16 GB 32 GB
Storage 10 GB (model cache) 20 GB

CPU-Only Mode

For machines without GPU, the embedder falls back to CPU (slower but functional):

embedder = Qwen3VLEmbedder(device="cpu", embedding_dim=256)  # Smaller dim for speed

Apple Silicon (MPS)

Native support for M1/M2/M3 Macs:

embedder = Qwen3VLEmbedder(device="mps")

Performance: ~200-500ms per embedding depending on chip.

Performance

Operation Demo Count Time (RTX 4090) Time (CPU)
Embed 1 demo 1 ~200ms ~2s
Embed 100 demos 100 ~15s ~3min
Query (text+image) any ~150ms ~2s

API Reference

MultimodalDemoRetriever

class MultimodalDemoRetriever:
    def __init__(
        self,
        embedding_model: str = "Alibaba-NLP/Qwen3-VL-Embedding",
        embedding_dim: int = 512,
        device: str | None = None,
        index_path: str | Path | None = None,
    ): ...

    def add_demo(
        self,
        demo_id: str,
        task: str,
        screenshot: str | Path | Image.Image | None = None,
        metadata: dict | None = None,
    ) -> None: ...

    def build_index(self, force: bool = False) -> None: ...

    def retrieve(
        self,
        task: str,
        screenshot: str | Path | Image.Image | None = None,
        top_k: int = 5,
    ) -> list[RetrievalResult]: ...

    def save(self, path: str | Path | None = None) -> None: ...
    def load(self, path: str | Path | None = None) -> None: ...

BaseEmbedder

class BaseEmbedder(ABC):
    @property
    def embedding_dim(self) -> int: ...
    @property
    def model_name(self) -> str: ...

    def embed_text(self, text: str) -> np.ndarray: ...
    def embed_image(self, image: str | Path | Image.Image) -> np.ndarray: ...
    def embed_multimodal(self, text: str, image: str | Path | Image.Image) -> np.ndarray: ...
    def embed_batch(self, inputs: list[dict]) -> np.ndarray: ...
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float: ...

Related Projects

References

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openadapt_retrieval-0.1.0.tar.gz (120.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openadapt_retrieval-0.1.0-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file openadapt_retrieval-0.1.0.tar.gz.

File metadata

  • Download URL: openadapt_retrieval-0.1.0.tar.gz
  • Upload date:
  • Size: 120.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openadapt_retrieval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1e30d34b881f3fdd54ff75daf9c4d1ce1243e0a0e7b0ee7ae30be514ee887d2b
MD5 59d55026fb7986e8bde24a3a589e259c
BLAKE2b-256 781087a0cc98556fba8fb3f9b9e923f404244f51e26c96d107f1494e3e3ac9cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_retrieval-0.1.0.tar.gz:

Publisher: publish.yml on OpenAdaptAI/openadapt-retrieval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openadapt_retrieval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for openadapt_retrieval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 535e5b65cc491be9f2a0b5305046257cf4779fdbe296e9e184b19216bda29c81
MD5 cc79d66c98cc271a92a76a639271788b
BLAKE2b-256 d67a49b3a1c2b341e9b75ac0f5d641efccc93016f32b5bb41d0198a64e53d09a

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_retrieval-0.1.0-py3-none-any.whl:

Publisher: publish.yml on OpenAdaptAI/openadapt-retrieval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page