Skip to main content

A live, holistic, and challenging benchmark for fashion image retrieval in real e-commerce settings

Project description

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Python PyPI License arXiv Project Page Dataset Model

LookBench is a live, holistic, and challenging benchmark for fashion image retrieval in real e-commerce settings. This repository provides the official evaluation code and model implementations.

๐Ÿ“ฐ News

๐Ÿ“– Overview

LookBench addresses the limitations of existing fashion retrieval benchmarks by providing:

  • ๐Ÿ”„ Continuously Refreshing Samples: Mitigates data contamination with time-stamped, periodically updated test sets
  • ๐ŸŽฏ Diverse Retrieval Tasks: Covers single-item and multi-item retrieval across real studio, AI-generated studio, real street-look, and AI-generated street-look scenarios
  • ๐Ÿ“Š Attribute-Supervised Evaluation: Fine-grained evaluation based on 100+ fashion attributes across categories
  • ๐Ÿ† Challenging Benchmarks: Many strong baselines achieve below 60% Recall@1

Benchmark Subsets

Dataset Image Source # Retrieval Items Difficulty # Queries / Corpus
RealStudioFlat Real studio flat-lay product photos Single Easy 1,011 / 62,226
AIGen-Studio AI-generated lifestyle studio images Single Medium 192 / 59,254
RealStreetLook Real street outfit photos Multi Hard 1,000 / 61,553
AIGen-StreetLook AI-generated street outfit compositions Multi Hard 160 / 58,846

๐Ÿš€ Quick Start

Installation

Option 1: Install from PyPI (Recommended)

pip install look-bench

Option 2: Install from Source

# Clone the repository
git clone https://github.com/SerendipityOneInc/look-bench.git
cd look-bench

# Install in development mode
pip install -e .

# Or install dependencies only
pip install -r requirements.txt

Optional: Install with Examples Support

For running example notebooks and scripts that require matplotlib:

pip install look-bench[examples]

Load Dataset from Hugging Face

The LookBench dataset is hosted on Hugging Face and can be loaded directly:

Option 1: Using look-bench utility (Recommended)

from look_bench.utils import load_lookbench_dataset

# Load a specific config
dataset = load_lookbench_dataset("real_studio_flat")

# Access query and gallery splits
query_data = dataset['query']
gallery_data = dataset['gallery']

print(f"Query samples: {len(query_data)}")
print(f"Gallery samples: {len(gallery_data)}")

Option 2: Using Hugging Face datasets directly

from datasets import load_dataset

# Load a specific config
dataset = load_dataset("srpone/look-bench", "real_studio_flat")

# Access query and gallery splits
query_data = dataset['query']
gallery_data = dataset['gallery']

print(f"Query samples: {len(query_data)}")
print(f"Gallery samples: {len(gallery_data)}")

Quick Evaluation

import torch
from manager import ConfigManager, ModelManager

# Load model
config_manager = ConfigManager('configs/config.yaml')
model_manager = ModelManager(config_manager)

model, _ = model_manager.load_model('clip')
transform = model_manager.get_transform('clip')

# Extract features from an image
sample = dataset['real_studio_flat']['query'][0]
image_tensor = transform(sample['image']).unsqueeze(0)

if torch.cuda.is_available():
    model = model.cuda()
    image_tensor = image_tensor.cuda()

with torch.no_grad():
    features = model(image_tensor)

print(f"Feature shape: {features.shape}")

Run Full Evaluation

# Run evaluation with default configuration
python main.py

# Run with specific model
python main.py --pipeline evaluation --model clip

# Use custom configuration
python main.py --config configs/config.yaml

Example Scripts & Notebooks

We provide both Python scripts and Google Colab notebooks for easy experimentation:

๐Ÿ““ Colab Notebooks (Run in Browser)

๐Ÿ Python Scripts (Run Locally)

# Run examples locally
python examples/00_data_exploration.py
python examples/01_load_grlite_model.py
python examples/02_model_evaluation.py
python examples/03_custom_model.py

๐Ÿ—๏ธ Architecture

look-bench/
โ”œโ”€โ”€ main.py                 # Main entry point (config-driven)
โ”œโ”€โ”€ manager.py              # Configuration, model, and data managers
โ”œโ”€โ”€ runner/                 # Pipeline execution framework
โ”‚   โ”œโ”€โ”€ base_pipeline.py   # Base pipeline class
โ”‚   โ”œโ”€โ”€ evaluator.py       # Core evaluation logic
โ”‚   โ”œโ”€โ”€ pipeline.py        # Pipeline registry
โ”‚   โ”œโ”€โ”€ evaluation_pipeline.py      # Standard evaluation pipeline
โ”‚   โ””โ”€โ”€ feature_extraction_pipeline.py  # Feature extraction pipeline
โ”œโ”€โ”€ models/                 # Model implementations and registry
โ”‚   โ”œโ”€โ”€ base.py            # Base model interface
โ”‚   โ”œโ”€โ”€ registry.py        # Model registration system
โ”‚   โ”œโ”€โ”€ factory.py         # Model factory
โ”‚   โ”œโ”€โ”€ clip_model.py      # CLIP model
โ”‚   โ”œโ”€โ”€ siglip_model.py    # SigLIP model
โ”‚   โ””โ”€โ”€ dinov2_model.py    # DINOv2 model
โ”œโ”€โ”€ datasets/               # Dataset loading (BEIR-style)
โ”‚   โ”œโ”€โ”€ base.py            # Base dataset implementation
โ”‚   โ””โ”€โ”€ registry.py        # Dataset registry
โ”œโ”€โ”€ metrics/                # Evaluation metrics
โ”‚   โ”œโ”€โ”€ rank.py            # Recall@K
โ”‚   โ”œโ”€โ”€ mrr.py             # Mean Reciprocal Rank
โ”‚   โ”œโ”€โ”€ ndcg.py            # Normalized Discounted Cumulative Gain
โ”‚   โ””โ”€โ”€ map.py             # Mean Average Precision
โ”œโ”€โ”€ configs/                # Configuration files
โ”‚   โ””โ”€โ”€ config.yaml        # Main configuration
โ””โ”€โ”€ utils/                  # Utilities and logging

๐ŸŽฏ Supported Models

Model Architecture Input Size Embedding Dim Framework
CLIP Vision Transformer 224ร—224 512 PyTorch
SigLIP Vision Transformer 224ร—224 768 PyTorch
DINOv2 Vision Transformer 224ร—224 768 PyTorch
GR-Lite Vision Transformer 336ร—336 1024 PyTorch

โš™๏ธ Configuration

Edit configs/config.yaml to configure models and evaluation settings:

# Pipeline configuration
pipeline:
  name: "evaluation"  # evaluation, feature_extraction
  model: "clip"
  dataset: "fashion200k"
  args: {}

# Model configuration
clip:
  enabled: true
  model_name: "openai/clip-vit-base-patch16"
  input_size: 224
  embedding_dim: 512
  device: "cuda"

# Evaluation settings
evaluation:
  metric: "recall"
  top_k: [1, 5, 10, 20]
  l2norm: true

๐Ÿ“Š Evaluation Metrics

LookBench supports multiple evaluation metrics:

  • Recall@K: Top-K retrieval accuracy (K=1, 5, 10, 20)
  • MRR: Mean Reciprocal Rank
  • NDCG@K: Normalized Discounted Cumulative Gain
  • MAP: Mean Average Precision

Fine-Grained Evaluation

All metrics are computed with attribute-level matching:

  • Fine Recall@1: Requires exact category and all attributes to match
  • Coarse Recall@1: Only requires category to match
  • nDCG@K: Graded relevance based on attribute overlap

๐Ÿ”ง Advanced Usage

Custom Model Integration

LookBench makes it easy to integrate your own models using the registry pattern. Here's a quick example:

from models.base import BaseModel
from models.registry import register_model
import torch.nn as nn
from torchvision import models, transforms

@register_model("resnet50", metadata={
    "description": "ResNet-50 for fashion retrieval",
    "framework": "PyTorch",
    "input_size": 224,
    "embedding_dim": 2048
})
class ResNet50Model(BaseModel):
    @classmethod
    def load_model(cls, model_name: str, model_path: str = None):
        model = models.resnet50(pretrained=True)
        model = nn.Sequential(*list(model.children())[:-1])  # Remove FC layer
        
        # Wrapper to flatten output
        class Wrapper(nn.Module):
            def __init__(self, backbone):
                super().__init__()
                self.backbone = backbone
            def forward(self, x):
                return self.backbone(x).squeeze(-1).squeeze(-1)
        
        return Wrapper(model), cls()
    
    @classmethod
    def get_transform(cls, input_size: int = 224):
        return transforms.Compose([
            transforms.Resize((input_size, input_size)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                               std=[0.229, 0.224, 0.225])
        ])

Then add your model to configs/config.yaml:

resnet50:
  enabled: true
  model_name: "resnet50"
  model_path: null  # or path to your weights
  input_size: 224
  embedding_dim: 2048
  device: "cuda"

For complete examples, see examples/03_custom_model.py

Custom Pipeline

Create custom evaluation pipelines:

from runner.base_pipeline import BasePipeline
from runner.pipeline import register_pipeline

@register_pipeline("custom_pipeline")
class CustomPipeline(BasePipeline):
    def get_pipeline_name(self) -> str:
        return "custom_pipeline"
    
    def run(self, **kwargs):
        # Your custom logic here
        model_name = kwargs.get('model_name', 'clip')
        dataset_type = kwargs.get('dataset_type', 'fashion200k')
        
        # Load model and data
        model, _ = self.model_manager.load_model(model_name)
        # ... your evaluation logic
        
        return {"status": "success", "results": results}

๐Ÿ“ˆ Results

Fine Recall@1 Performance

Our GR-Lite model achieves state-of-the-art performance on LookBench. Fine Recall@1 requires exact category and all attributes to match:

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 63.67 54.88 44.75 51.55 49.80
GR-Lite (Ours, Open) 336 / 1024 62.47 52.08 43.84 51.70 49.18
Marqo-FashionSigLIP 224 / 768 66.27 58.53 42.43 51.86 49.44
Marqo-FashionCLIP 224 / 512 63.22 54.93 41.87 51.68 48.63
SigLIP2-B/16 384 / 768 57.83 54.97 39.35 49.12 46.10
SigLIP2-L/16 384 / 1024 51.89 48.57 35.91 44.78 41.86
PP-ShiTuV2 224 / 512 30.06 33.69 32.77 43.22 37.17
DINOv3-ViT-L 224 / 1024 20.24 27.66 26.27 39.85 31.83
DINOv2-ViT-L 224 / 1024 24.29 25.05 22.99 37.66 29.57
CLIP-L/14 336 / 768 25.28 25.95 21.09 40.35 30.08
CLIP-B/16 224 / 512 17.86 13.75 16.80 34.75 24.36

Coarse Recall@1 Performance

Coarse Recall@1 only requires category match (more lenient):

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 92.50 92.75 79.82 94.16 87.93
GR-Lite (Ours, Open) 336 / 1024 88.75 90.16 76.76 92.68 85.54
Marqo-FashionSigLIP 224 / 768 90.00 93.78 73.39 88.63 82.77
Marqo-FashionCLIP 224 / 512 84.38 87.05 75.33 88.72 82.68
SigLIP2-B/16 384 / 768 86.25 90.67 72.17 88.33 81.62
SigLIP2-L/16 384 / 1024 80.62 90.67 68.20 84.97 78.12
CLIP-L/14 336 / 768 46.88 56.48 45.26 76.85 59.91
CLIP-B/16 224 / 512 35.62 32.12 33.54 67.26 48.11

nDCG@5 Performance

nDCG@5 evaluates ranking quality with graded relevance based on attribute overlap:

Model Resolution / Emb. AIGen-StreetLook AIGen-Studio RealStreetLook RealStudioFlat Overall
GR-Pro (Ours) 336 / 1024 63.67 54.88 44.75 51.55 49.80
GR-Lite (Ours, Open) 336 / 1024 62.47 52.08 43.84 51.70 49.18
Marqo-FashionSigLIP 224 / 768 66.27 58.53 42.43 51.86 49.44
Marqo-FashionCLIP 224 / 512 63.22 54.93 41.87 51.68 48.63
SigLIP2-B/16 384 / 768 57.83 54.97 39.35 49.12 46.10

See our paper for complete results including MRR and additional models.

๐Ÿ“„ Citation

If you use LookBench in your research, please cite:

@article{gao2026lookbench,
  title={LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval}, 
  author={Chao Gao and Siqiao Xue and Yimin Peng and Jiwen Fu and Tingyi Gu and Shanshan Li and Fan Zhou},
  year={2026},
  url={https://arxiv.org/abs/2601.14706}, 
  journal={arXiv preprint arXiv:2601.14706},
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

The GR-Lite model weights are distributed under the DINOv3 License as they are derived from Meta's DINOv3 model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

look_bench-0.3.0.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

look_bench-0.3.0-py3-none-any.whl (56.2 kB view details)

Uploaded Python 3

File details

Details for the file look_bench-0.3.0.tar.gz.

File metadata

  • Download URL: look_bench-0.3.0.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for look_bench-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6ffb11eb63b57789b73f9d79ce84547dfcc261895aec26a805e26fdbbfdb231a
MD5 22b52495cce03c62330c6e43ce9ee4d2
BLAKE2b-256 68e6f9d15189fdf242d6dff8cf7043d9e8a752aac3a46b573fd9f5812916e753

See more details on using hashes here.

Provenance

The following attestation bundles were made for look_bench-0.3.0.tar.gz:

Publisher: python-publish.yml on SerendipityOneInc/look-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file look_bench-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: look_bench-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 56.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for look_bench-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65b727a7e91705c730663a29d575bb9cfaf45627a588c86b3cbe122ad7cce9ed
MD5 9f6ad643476740fbc8158765a6a5b543
BLAKE2b-256 19b3c30cde60a026eb0c6c2eb7c6b06fd4aa10ecf6f4c37a27d2d184834e075f

See more details on using hashes here.

Provenance

The following attestation bundles were made for look_bench-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on SerendipityOneInc/look-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page