Skip to main content

A live, holistic, and challenging benchmark for fashion image retrieval in real e-commerce settings

Project description

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Python PyTorch License arXiv Project Page Dataset Model

LookBench is a live, holistic, and challenging benchmark for fashion image retrieval in real e-commerce settings. This repository provides the official evaluation code and model implementations.

๐Ÿ“ฐ News

  • [2026-01] LookBench paper released on arXiv
  • [2026-01] GR-Lite open-source model released
  • [2026-01] Initial benchmark dataset released

๐Ÿ“– Overview

LookBench addresses the limitations of existing fashion retrieval benchmarks by providing:

  • ๐Ÿ”„ Continuously Refreshing Samples: Mitigates data contamination with time-stamped, periodically updated test sets
  • ๐ŸŽฏ Diverse Retrieval Tasks: Covers single-item and multi-item retrieval across real studio, AI-generated studio, real street-look, and AI-generated street-look scenarios
  • ๐Ÿ“Š Attribute-Supervised Evaluation: Fine-grained evaluation based on 100+ fashion attributes across categories
  • ๐Ÿ† Challenging Benchmarks: Many strong baselines achieve below 60% Recall@1

Benchmark Subsets

Dataset Image Source # Retrieval Items Difficulty # Queries / Corpus
RealStudioFlat Real studio flat-lay product photos Single Easy 1,011 / 62,226
AIGen-Studio AI-generated lifestyle studio images Single Medium 192 / 59,254
RealStreetLook Real street outfit photos Multi Hard 1,000 / 61,553
AIGen-StreetLook AI-generated street outfit compositions Multi Hard 160 / 58,846

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/SerendipityOneInc/look-bench.git
cd look-bench

# Install dependencies
pip install -r requirements.txt

Load Dataset from Hugging Face

The LookBench dataset is hosted on Hugging Face and can be loaded directly:

from datasets import load_dataset

# Load the entire LookBench dataset
dataset = load_dataset("srpone/look-bench")

# Access different subsets
real_studio = dataset['real_studio_flat']  # Easy: single-item retrieval
aigen_studio = dataset['aigen_studio']     # Medium: AI-generated studio images
real_street = dataset['real_streetlook']   # Hard: multi-item outfit retrieval
aigen_street = dataset['aigen_streetlook'] # Hard: AI-generated street looks

# Each subset has query and gallery splits
query_data = dataset['real_studio_flat']['query']
gallery_data = dataset['real_studio_flat']['gallery']

print(f"Query samples: {len(query_data)}")
print(f"Gallery samples: {len(gallery_data)}")

Quick Evaluation

import torch
from manager import ConfigManager, ModelManager

# Load model
config_manager = ConfigManager('configs/config.yaml')
model_manager = ModelManager(config_manager)

model, _ = model_manager.load_model('clip')
transform = model_manager.get_transform('clip')

# Extract features from an image
sample = dataset['real_studio_flat']['query'][0]
image_tensor = transform(sample['image']).unsqueeze(0)

if torch.cuda.is_available():
    model = model.cuda()
    image_tensor = image_tensor.cuda()

with torch.no_grad():
    features = model(image_tensor)

print(f"Feature shape: {features.shape}")

Run Full Evaluation

# Run evaluation with default configuration
python main.py

# Run with specific model
python main.py --pipeline evaluation --model clip

# Use custom configuration
python main.py --config configs/config.yaml

Example Scripts & Notebooks

We provide both Python scripts and Google Colab notebooks for easy experimentation:

๐Ÿ““ Colab Notebooks (Run in Browser)

๐Ÿ Python Scripts (Run Locally)

# Run examples locally
python examples/01_quickstart.py
python examples/02_model_evaluation.py
python examples/03_custom_model.py

๐Ÿ—๏ธ Architecture

look-bench/
โ”œโ”€โ”€ main.py                 # Main entry point (config-driven)
โ”œโ”€โ”€ manager.py              # Configuration, model, and data managers
โ”œโ”€โ”€ runner/                 # Pipeline execution framework
โ”‚   โ”œโ”€โ”€ base_pipeline.py   # Base pipeline class
โ”‚   โ”œโ”€โ”€ evaluator.py       # Core evaluation logic
โ”‚   โ”œโ”€โ”€ pipeline.py        # Pipeline registry
โ”‚   โ”œโ”€โ”€ evaluation_pipeline.py      # Standard evaluation pipeline
โ”‚   โ””โ”€โ”€ feature_extraction_pipeline.py  # Feature extraction pipeline
โ”œโ”€โ”€ models/                 # Model implementations and registry
โ”‚   โ”œโ”€โ”€ base.py            # Base model interface
โ”‚   โ”œโ”€โ”€ registry.py        # Model registration system
โ”‚   โ”œโ”€โ”€ factory.py         # Model factory
โ”‚   โ”œโ”€โ”€ clip_model.py      # CLIP model
โ”‚   โ”œโ”€โ”€ siglip_model.py    # SigLIP model
โ”‚   โ””โ”€โ”€ dinov2_model.py    # DINOv2 model
โ”œโ”€โ”€ datasets/               # Dataset loading (BEIR-style)
โ”‚   โ”œโ”€โ”€ base.py            # Base dataset implementation
โ”‚   โ””โ”€โ”€ registry.py        # Dataset registry
โ”œโ”€โ”€ metrics/                # Evaluation metrics
โ”‚   โ”œโ”€โ”€ rank.py            # Recall@K
โ”‚   โ”œโ”€โ”€ mrr.py             # Mean Reciprocal Rank
โ”‚   โ”œโ”€โ”€ ndcg.py            # Normalized Discounted Cumulative Gain
โ”‚   โ””โ”€โ”€ map.py             # Mean Average Precision
โ”œโ”€โ”€ configs/                # Configuration files
โ”‚   โ””โ”€โ”€ config.yaml        # Main configuration
โ””โ”€โ”€ utils/                  # Utilities and logging

๐ŸŽฏ Supported Models

Model Architecture Input Size Embedding Dim Framework
CLIP Vision Transformer 224ร—224 512 PyTorch
SigLIP Vision Transformer 224ร—224 768 PyTorch
DINOv2 Vision Transformer 224ร—224 768 PyTorch
GR-Lite Vision Transformer 336ร—336 1024 PyTorch

โš™๏ธ Configuration

Edit configs/config.yaml to configure models and evaluation settings:

# Pipeline configuration
pipeline:
  name: "evaluation"  # evaluation, feature_extraction
  model: "clip"
  dataset: "fashion200k"
  args: {}

# Model configuration
clip:
  enabled: true
  model_name: "openai/clip-vit-base-patch16"
  input_size: 224
  embedding_dim: 512
  device: "cuda"

# Evaluation settings
evaluation:
  metric: "recall"
  top_k: [1, 5, 10, 20]
  l2norm: true

๐Ÿ“Š Evaluation Metrics

LookBench supports multiple evaluation metrics:

  • Recall@K: Top-K retrieval accuracy (K=1, 5, 10, 20)
  • MRR: Mean Reciprocal Rank
  • NDCG@K: Normalized Discounted Cumulative Gain
  • MAP: Mean Average Precision

Fine-Grained Evaluation

All metrics are computed with attribute-level matching:

  • Fine Recall@1: Requires exact category and all attributes to match
  • Coarse Recall@1: Only requires category to match
  • nDCG@K: Graded relevance based on attribute overlap

๐Ÿ”ง Advanced Usage

Custom Model Integration

LookBench makes it easy to integrate your own models using the registry pattern. Here's a quick example:

from models.base import BaseModel
from models.registry import register_model
import torch.nn as nn
from torchvision import models, transforms

@register_model("resnet50", metadata={
    "description": "ResNet-50 for fashion retrieval",
    "framework": "PyTorch",
    "input_size": 224,
    "embedding_dim": 2048
})
class ResNet50Model(BaseModel):
    @classmethod
    def load_model(cls, model_name: str, model_path: str = None):
        model = models.resnet50(pretrained=True)
        model = nn.Sequential(*list(model.children())[:-1])  # Remove FC layer
        
        # Wrapper to flatten output
        class Wrapper(nn.Module):
            def __init__(self, backbone):
                super().__init__()
                self.backbone = backbone
            def forward(self, x):
                return self.backbone(x).squeeze(-1).squeeze(-1)
        
        return Wrapper(model), cls()
    
    @classmethod
    def get_transform(cls, input_size: int = 224):
        return transforms.Compose([
            transforms.Resize((input_size, input_size)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])

Then add your model to configs/config.yaml:

resnet50:
  enabled: true
  model_name: "resnet50"
  model_path: null  # or path to your weights
  input_size: 224
  embedding_dim: 2048
  device: "cuda"

For complete examples, see examples/03_custom_model.py

Custom Pipeline

Create custom evaluation pipelines:

from runner.base_pipeline import BasePipeline
from runner.pipeline import register_pipeline

@register_pipeline("custom_pipeline")
class CustomPipeline(BasePipeline):
    def get_pipeline_name(self) -> str:
        return "custom_pipeline"
    
    def run(self, **kwargs):
        # Your custom logic here
        model_name = kwargs.get('model_name', 'clip')
        dataset_type = kwargs.get('dataset_type', 'fashion200k')
        
        # Load model and data
        model, _ = self.model_manager.load_model(model_name)
        # ... your evaluation logic
        
        return {"status": "success", "results": results}

๐Ÿ“ˆ Results

Fine Recall@1 Performance

Our GR-Lite model achieves state-of-the-art performance on LookBench. Fine Recall@1 requires exact category and all attributes to match:

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 63.67 54.88 44.75 51.55 49.80
GR-Lite (Ours, Open) 336 / 1024 62.47 52.08 43.84 51.70 49.18
Marqo-FashionSigLIP 224 / 768 66.27 58.53 42.43 51.86 49.44
Marqo-FashionCLIP 224 / 512 63.22 54.93 41.87 51.68 48.63
SigLIP2-B/16 384 / 768 57.83 54.97 39.35 49.12 46.10
SigLIP2-L/16 384 / 1024 51.89 48.57 35.91 44.78 41.86
PP-ShiTuV2 224 / 512 30.06 33.69 32.77 43.22 37.17
DINOv3-ViT-L 224 / 1024 20.24 27.66 26.27 39.85 31.83
DINOv2-ViT-L 224 / 1024 24.29 25.05 22.99 37.66 29.57
CLIP-L/14 336 / 768 25.28 25.95 21.09 40.35 30.08
CLIP-B/16 224 / 512 17.86 13.75 16.80 34.75 24.36

Coarse Recall@1 Performance

Coarse Recall@1 only requires category match (more lenient):

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 92.50 92.75 79.82 94.16 87.93
GR-Lite (Ours, Open) 336 / 1024 88.75 90.16 76.76 92.68 85.54
Marqo-FashionSigLIP 224 / 768 90.00 93.78 73.39 88.63 82.77
Marqo-FashionCLIP 224 / 512 84.38 87.05 75.33 88.72 82.68
SigLIP2-B/16 384 / 768 86.25 90.67 72.17 88.33 81.62
SigLIP2-L/16 384 / 1024 80.62 90.67 68.20 84.97 78.12
CLIP-L/14 336 / 768 46.88 56.48 45.26 76.85 59.91
CLIP-B/16 224 / 512 35.62 32.12 33.54 67.26 48.11

nDCG@5 Performance

nDCG@5 evaluates ranking quality with graded relevance based on attribute overlap:

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 63.67 54.88 44.75 51.55 49.80
GR-Lite (Ours, Open) 336 / 1024 62.47 52.08 43.84 51.70 49.18
Marqo-FashionSigLIP 224 / 768 66.27 58.53 42.43 51.86 49.44
Marqo-FashionCLIP 224 / 512 63.22 54.93 41.87 51.68 48.63
SigLIP2-B/16 384 / 768 57.83 54.97 39.35 49.12 46.10

See our paper for complete results including MRR and additional models.

๐Ÿ“„ Citation

If you use LookBench in your research, please cite:

@article{gao2026lookbench,
  title={LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval}, 
  author={Chao Gao and Siqiao Xue and Yimin Peng and Jiwen Fu and Tingyi Gu and Shanshan Li and Fan Zhou},
  year={2026},
  url={https://arxiv.org/abs/2601.14706}, 
  journal={arXiv preprint arXiv:2601.14706},
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

The GR-Lite model weights are distributed under the DINOv3 License as they are derived from Meta's DINOv3 model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

look_bench-0.2.0.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

look_bench-0.2.0-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file look_bench-0.2.0.tar.gz.

File metadata

  • Download URL: look_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for look_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 748716e76b5574c5bc9cd44dfe3683dce8708ffbdb43314d94eaea27dedb0528
MD5 2e7394858dfd88db0809cc8f52dc290c
BLAKE2b-256 18892513f7397aa1168468aaf238f10b2babd61ac9d13cc87fee40888f2bb1fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for look_bench-0.2.0.tar.gz:

Publisher: python-publish.yml on SerendipityOneInc/look-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file look_bench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: look_bench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 48.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for look_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a509b3f7a1efb447d962c70bc7560c99a8c15a37a7c0352ce5384fefa9b8021a
MD5 ed067f9d864170a915513ecc01c56d9a
BLAKE2b-256 f996282dd8ffc73d07eec9b3ff77ee2a348bb6813b6888b0af8e246e13a043bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for look_bench-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on SerendipityOneInc/look-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page