Multimodal demo retrieval using VLM embeddings for GUI automation
Project description
openadapt-retrieval
Multimodal demo retrieval using VLM embeddings for GUI automation.
Overview
openadapt-retrieval provides a unified interface for creating multimodal embeddings from screenshots and task descriptions, enabling semantic demo retrieval for GUI automation agents.
Key Features:
- Multimodal Embeddings: Embed text, images, or both into a shared vector space
- Qwen3-VL-Embedding Support: Primary embedder using Alibaba's Qwen3-VL model
- Matryoshka Representation Learning (MRL): Flexible embedding dimensions (512-8192)
- FAISS Integration: Fast similarity search with support for large demo libraries
- Persistence: Save and load indices with embeddings and metadata
- CLI Interface: Easy command-line access for indexing and searching
Installation
# Basic installation
pip install openadapt-retrieval
# With GPU support
pip install openadapt-retrieval[gpu]
# With CLIP fallback embedder
pip install openadapt-retrieval[clip]
# All optional dependencies
pip install openadapt-retrieval[all]
For development:
git clone https://github.com/OpenAdaptAI/openadapt-retrieval.git
cd openadapt-retrieval
uv sync --all-extras
Quick Start
Python API
from openadapt_retrieval import MultimodalDemoRetriever, Qwen3VLEmbedder
# Initialize retriever
retriever = MultimodalDemoRetriever(
embedding_dim=512, # Use MRL for smaller storage
)
# Add demos (from your recording library)
for demo in demos:
retriever.add_demo(
demo_id=demo.id,
task=demo.instruction,
screenshot=demo.first_screenshot_path,
metadata={"app": demo.app_name},
)
# Build the index
retriever.build_index()
# Save for later use
retriever.save("/path/to/demo_index")
# Retrieve similar demos
results = retriever.retrieve(
task="Disable Night Shift",
screenshot="/path/to/current_screen.png",
top_k=3,
)
for result in results:
print(f"{result.demo_id}: {result.task} (score: {result.score:.3f})")
Using the Embedder Directly
from openadapt_retrieval.embeddings import Qwen3VLEmbedder
# Initialize embedder
embedder = Qwen3VLEmbedder(embedding_dim=512)
# Embed text only
text_emb = embedder.embed_text("Turn off Night Shift")
# Embed image only
img_emb = embedder.embed_image("/path/to/screenshot.png")
# Embed multimodal (recommended)
mm_emb = embedder.embed_multimodal(
text="Turn off Night Shift",
image="/path/to/screenshot.png",
)
# Compute similarity
similarity = embedder.cosine_similarity(query_emb, demo_emb)
CLI Usage
# Embed a single image
openadapt-retrieval embed --image screenshot.png --output embedding.npy
# Embed text + image
openadapt-retrieval embed --text "Turn off Night Shift" --image screenshot.png
# Build index from directory of demos
openadapt-retrieval index --demo-dir /path/to/demos --output demo_index/
# Search the index
openadapt-retrieval search --index demo_index/ --text "disable display setting" --top-k 5
# Search with screenshot
openadapt-retrieval search --index demo_index/ --text "disable display" --image current.png --top-k 3
Architecture
Embeddings Module
openadapt_retrieval/embeddings/
├── base.py # BaseEmbedder abstract class
├── qwen3vl.py # Qwen3-VL-Embedding implementation
├── clip.py # CLIP fallback (lighter weight)
└── registry.py # get_embedder() factory
Supported Models:
| Model | Embedding Dim | VRAM | Use Case |
|---|---|---|---|
Alibaba-NLP/Qwen3-VL-Embedding |
512-8192 (MRL) | ~8GB | Primary (best quality) |
openai/clip-vit-large-patch14 |
768 | ~2GB | Fallback (lighter) |
Retriever Module
openadapt_retrieval/retriever/
├── demo_retriever.py # MultimodalDemoRetriever
├── index.py # VectorIndex (FAISS wrapper)
└── reranker.py # CrossEncoderReranker (optional)
Key Classes:
MultimodalDemoRetriever: Main interface for indexing and retrieving demosVectorIndex: FAISS index wrapper with save/load supportCrossEncoderReranker: Optional two-stage retrieval with cross-attention
Storage Module
openadapt_retrieval/storage/
└── persistence.py # EmbeddingStorage for save/load
Index Format:
demo_index/
├── index.json # Metadata and configuration
├── embeddings.npy # Embedding vectors (float32)
└── faiss.index # FAISS index (optional, for large indices)
Configuration
Embedding Dimensions (MRL)
Qwen3-VL-Embedding supports Matryoshka Representation Learning for flexible dimensions:
# Full dimension (best quality)
embedder = Qwen3VLEmbedder(embedding_dim=None) # Uses 8192 for full model
# Reduced dimensions (faster search, smaller storage)
embedder = Qwen3VLEmbedder(embedding_dim=512) # Good balance
embedder = Qwen3VLEmbedder(embedding_dim=256) # Faster, slightly lower quality
Device Selection
# Auto-detect (CUDA > MPS > CPU)
embedder = Qwen3VLEmbedder()
# Force specific device
embedder = Qwen3VLEmbedder(device="cuda:0")
embedder = Qwen3VLEmbedder(device="mps") # Apple Silicon
embedder = Qwen3VLEmbedder(device="cpu")
Hardware Requirements
Qwen3-VL-Embedding
| Component | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3060 (12GB) | RTX 4090 (24GB) |
| VRAM | 6 GB (FP16) | 8 GB |
| RAM | 16 GB | 32 GB |
| Storage | 10 GB (model cache) | 20 GB |
CPU-Only Mode
For machines without GPU, the embedder falls back to CPU (slower but functional):
embedder = Qwen3VLEmbedder(device="cpu", embedding_dim=256) # Smaller dim for speed
Apple Silicon (MPS)
Native support for M1/M2/M3 Macs:
embedder = Qwen3VLEmbedder(device="mps")
Performance: ~200-500ms per embedding depending on chip.
Performance
| Operation | Demo Count | Time (RTX 4090) | Time (CPU) |
|---|---|---|---|
| Embed 1 demo | 1 | ~200ms | ~2s |
| Embed 100 demos | 100 | ~15s | ~3min |
| Query (text+image) | any | ~150ms | ~2s |
API Reference
MultimodalDemoRetriever
class MultimodalDemoRetriever:
def __init__(
self,
embedding_model: str = "Alibaba-NLP/Qwen3-VL-Embedding",
embedding_dim: int = 512,
device: str | None = None,
index_path: str | Path | None = None,
): ...
def add_demo(
self,
demo_id: str,
task: str,
screenshot: str | Path | Image.Image | None = None,
metadata: dict | None = None,
) -> None: ...
def build_index(self, force: bool = False) -> None: ...
def retrieve(
self,
task: str,
screenshot: str | Path | Image.Image | None = None,
top_k: int = 5,
) -> list[RetrievalResult]: ...
def save(self, path: str | Path | None = None) -> None: ...
def load(self, path: str | Path | None = None) -> None: ...
BaseEmbedder
class BaseEmbedder(ABC):
@property
def embedding_dim(self) -> int: ...
@property
def model_name(self) -> str: ...
def embed_text(self, text: str) -> np.ndarray: ...
def embed_image(self, image: str | Path | Image.Image) -> np.ndarray: ...
def embed_multimodal(self, text: str, image: str | Path | Image.Image) -> np.ndarray: ...
def embed_batch(self, inputs: list[dict]) -> np.ndarray: ...
def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float: ...
Related Projects
- openadapt-ml - ML engine for GUI automation
- openadapt-grounding - UI element localization
- openadapt-evals - Benchmark evaluation infrastructure
- openadapt-viewer - Dashboard visualization
References
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openadapt_retrieval-0.1.2.tar.gz.
File metadata
- Download URL: openadapt_retrieval-0.1.2.tar.gz
- Upload date:
- Size: 121.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fde01cc825bdd20940f59fe9d3d66d7789ea6b928650d3006011247d38a524a9
|
|
| MD5 |
57ef72019e6be27730f7dfe10b13f01f
|
|
| BLAKE2b-256 |
9b1aa482c35a90fbc444a7f2d5e4c0fc47ca13095b403d20535bfe94cf719992
|
Provenance
The following attestation bundles were made for openadapt_retrieval-0.1.2.tar.gz:
Publisher:
release.yml on OpenAdaptAI/openadapt-retrieval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openadapt_retrieval-0.1.2.tar.gz -
Subject digest:
fde01cc825bdd20940f59fe9d3d66d7789ea6b928650d3006011247d38a524a9 - Sigstore transparency entry: 871240332
- Sigstore integration time:
-
Permalink:
OpenAdaptAI/openadapt-retrieval@d4ffec47b9d996826fb0d59d6675c01f07fa119d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/OpenAdaptAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4ffec47b9d996826fb0d59d6675c01f07fa119d -
Trigger Event:
push
-
Statement type:
File details
Details for the file openadapt_retrieval-0.1.2-py3-none-any.whl.
File metadata
- Download URL: openadapt_retrieval-0.1.2-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7efb8193e6c9783366b6349aaec3b9e17466b211805338c55ee4e4fbd55966b0
|
|
| MD5 |
363877a8495de54a8f78d842e018d349
|
|
| BLAKE2b-256 |
39174edf859f55292be355620f2e2989e7e317ebeafcc31f1baf6c96687ac198
|
Provenance
The following attestation bundles were made for openadapt_retrieval-0.1.2-py3-none-any.whl:
Publisher:
release.yml on OpenAdaptAI/openadapt-retrieval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openadapt_retrieval-0.1.2-py3-none-any.whl -
Subject digest:
7efb8193e6c9783366b6349aaec3b9e17466b211805338c55ee4e4fbd55966b0 - Sigstore transparency entry: 871240335
- Sigstore integration time:
-
Permalink:
OpenAdaptAI/openadapt-retrieval@d4ffec47b9d996826fb0d59d6675c01f07fa119d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/OpenAdaptAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4ffec47b9d996826fb0d59d6675c01f07fa119d -
Trigger Event:
push
-
Statement type: