Skip to main content

Multimodal demo retrieval using VLM embeddings for GUI automation

Project description

openadapt-retrieval

Build Status PyPI version Downloads License: MIT Python 3.10+

Multimodal demo retrieval using VLM embeddings for GUI automation.

Overview

openadapt-retrieval provides a unified interface for creating multimodal embeddings from screenshots and task descriptions, enabling semantic demo retrieval for GUI automation agents.

Key Features:

  • Multimodal Embeddings: Embed text, images, or both into a shared vector space
  • Qwen3-VL-Embedding Support: Primary embedder using Alibaba's state-of-the-art VLM
  • Matryoshka Representation Learning (MRL): Flexible embedding dimensions (512-8192)
  • FAISS Integration: Fast similarity search with support for large demo libraries
  • Persistence: Save and load indices with embeddings and metadata
  • CLI Interface: Easy command-line access for indexing and searching

Installation

# Basic installation
pip install openadapt-retrieval

# With GPU support
pip install openadapt-retrieval[gpu]

# With CLIP fallback embedder
pip install openadapt-retrieval[clip]

# All optional dependencies
pip install openadapt-retrieval[all]

For development:

git clone https://github.com/OpenAdaptAI/openadapt-retrieval.git
cd openadapt-retrieval
uv sync --all-extras

Quick Start

Python API

from openadapt_retrieval import MultimodalDemoRetriever, Qwen3VLEmbedder

# Initialize retriever
retriever = MultimodalDemoRetriever(
    embedding_dim=512,  # Use MRL for smaller storage
)

# Add demos (from your recording library)
for demo in demos:
    retriever.add_demo(
        demo_id=demo.id,
        task=demo.instruction,
        screenshot=demo.first_screenshot_path,
        metadata={"app": demo.app_name},
    )

# Build the index
retriever.build_index()

# Save for later use
retriever.save("/path/to/demo_index")

# Retrieve similar demos
results = retriever.retrieve(
    task="Disable Night Shift",
    screenshot="/path/to/current_screen.png",
    top_k=3,
)

for result in results:
    print(f"{result.demo_id}: {result.task} (score: {result.score:.3f})")

Using the Embedder Directly

from openadapt_retrieval.embeddings import Qwen3VLEmbedder

# Initialize embedder
embedder = Qwen3VLEmbedder(embedding_dim=512)

# Embed text only
text_emb = embedder.embed_text("Turn off Night Shift")

# Embed image only
img_emb = embedder.embed_image("/path/to/screenshot.png")

# Embed multimodal (recommended)
mm_emb = embedder.embed_multimodal(
    text="Turn off Night Shift",
    image="/path/to/screenshot.png",
)

# Compute similarity
similarity = embedder.cosine_similarity(query_emb, demo_emb)

CLI Usage

# Embed a single image
openadapt-retrieval embed --image screenshot.png --output embedding.npy

# Embed text + image
openadapt-retrieval embed --text "Turn off Night Shift" --image screenshot.png

# Build index from directory of demos
openadapt-retrieval index --demo-dir /path/to/demos --output demo_index/

# Search the index
openadapt-retrieval search --index demo_index/ --text "disable display setting" --top-k 5

# Search with screenshot
openadapt-retrieval search --index demo_index/ --text "disable display" --image current.png --top-k 3

Architecture

Embeddings Module

openadapt_retrieval/embeddings/
├── base.py       # BaseEmbedder abstract class
├── qwen3vl.py    # Qwen3-VL-Embedding implementation
├── clip.py       # CLIP fallback (lighter weight)
└── registry.py   # get_embedder() factory

Supported Models:

Model Embedding Dim VRAM Use Case
Alibaba-NLP/Qwen3-VL-Embedding 512-8192 (MRL) ~8GB Primary (best quality)
openai/clip-vit-large-patch14 768 ~2GB Fallback (lighter)

Retriever Module

openadapt_retrieval/retriever/
├── demo_retriever.py  # MultimodalDemoRetriever
├── index.py           # VectorIndex (FAISS wrapper)
└── reranker.py        # CrossEncoderReranker (optional)

Key Classes:

  • MultimodalDemoRetriever: Main interface for indexing and retrieving demos
  • VectorIndex: FAISS index wrapper with save/load support
  • CrossEncoderReranker: Optional two-stage retrieval with cross-attention

Storage Module

openadapt_retrieval/storage/
└── persistence.py  # EmbeddingStorage for save/load

Index Format:

demo_index/
├── index.json       # Metadata and configuration
├── embeddings.npy   # Embedding vectors (float32)
└── faiss.index      # FAISS index (optional, for large indices)

Configuration

Embedding Dimensions (MRL)

Qwen3-VL-Embedding supports Matryoshka Representation Learning for flexible dimensions:

# Full dimension (best quality)
embedder = Qwen3VLEmbedder(embedding_dim=None)  # Uses 8192 for full model

# Reduced dimensions (faster search, smaller storage)
embedder = Qwen3VLEmbedder(embedding_dim=512)   # Good balance
embedder = Qwen3VLEmbedder(embedding_dim=256)   # Faster, slightly lower quality

Device Selection

# Auto-detect (CUDA > MPS > CPU)
embedder = Qwen3VLEmbedder()

# Force specific device
embedder = Qwen3VLEmbedder(device="cuda:0")
embedder = Qwen3VLEmbedder(device="mps")  # Apple Silicon
embedder = Qwen3VLEmbedder(device="cpu")

Hardware Requirements

Qwen3-VL-Embedding

Component Minimum Recommended
GPU RTX 3060 (12GB) RTX 4090 (24GB)
VRAM 6 GB (FP16) 8 GB
RAM 16 GB 32 GB
Storage 10 GB (model cache) 20 GB

CPU-Only Mode

For machines without GPU, the embedder falls back to CPU (slower but functional):

embedder = Qwen3VLEmbedder(device="cpu", embedding_dim=256)  # Smaller dim for speed

Apple Silicon (MPS)

Native support for M1/M2/M3 Macs:

embedder = Qwen3VLEmbedder(device="mps")

Performance: ~200-500ms per embedding depending on chip.

Performance

Operation Demo Count Time (RTX 4090) Time (CPU)
Embed 1 demo 1 ~200ms ~2s
Embed 100 demos 100 ~15s ~3min
Query (text+image) any ~150ms ~2s

API Reference

MultimodalDemoRetriever

class MultimodalDemoRetriever:
    def __init__(
        self,
        embedding_model: str = "Alibaba-NLP/Qwen3-VL-Embedding",
        embedding_dim: int = 512,
        device: str | None = None,
        index_path: str | Path | None = None,
    ): ...

    def add_demo(
        self,
        demo_id: str,
        task: str,
        screenshot: str | Path | Image.Image | None = None,
        metadata: dict | None = None,
    ) -> None: ...

    def build_index(self, force: bool = False) -> None: ...

    def retrieve(
        self,
        task: str,
        screenshot: str | Path | Image.Image | None = None,
        top_k: int = 5,
    ) -> list[RetrievalResult]: ...

    def save(self, path: str | Path | None = None) -> None: ...
    def load(self, path: str | Path | None = None) -> None: ...

BaseEmbedder

class BaseEmbedder(ABC):
    @property
    def embedding_dim(self) -> int: ...
    @property
    def model_name(self) -> str: ...

    def embed_text(self, text: str) -> np.ndarray: ...
    def embed_image(self, image: str | Path | Image.Image) -> np.ndarray: ...
    def embed_multimodal(self, text: str, image: str | Path | Image.Image) -> np.ndarray: ...
    def embed_batch(self, inputs: list[dict]) -> np.ndarray: ...
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float: ...

Related Projects

References

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openadapt_retrieval-0.1.1.tar.gz (121.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openadapt_retrieval-0.1.1-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file openadapt_retrieval-0.1.1.tar.gz.

File metadata

  • Download URL: openadapt_retrieval-0.1.1.tar.gz
  • Upload date:
  • Size: 121.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openadapt_retrieval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c7382e9571ba6f9011dd2e8a5c90bb39756f79d26cb2326f2b92476ad0606591
MD5 e18e6e0a223a141d22bce83ee27920da
BLAKE2b-256 d4644726c7bf23aceb589f4ee78f787c77605627d98ed231916c3d046c85aac4

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_retrieval-0.1.1.tar.gz:

Publisher: release.yml on OpenAdaptAI/openadapt-retrieval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openadapt_retrieval-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for openadapt_retrieval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1c5355571adf875c24b199d0724e5a0cc3742bfd8444e6776c0cd103d814ef90
MD5 4fe5e82caa1535de621f5edd58f814cd
BLAKE2b-256 a3b70552728c98d06f2234d739873cbed7fb122ee2b598d9ed51bc46a0a23452

See more details on using hashes here.

Provenance

The following attestation bundles were made for openadapt_retrieval-0.1.1-py3-none-any.whl:

Publisher: release.yml on OpenAdaptAI/openadapt-retrieval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page