Skip to main content

GenArena Arena Evaluation - VLM-based pairwise image generation evaluation

Project description

GenArena

A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking.

arXiv Project Page Leaderboard Hugging Face

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

Quick Start

Installation

pip install genarena

Or install from source:

git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .

Initialize Arena

Download benchmark data and official arena data with one command:

genarena init --arena_dir ./arena --data_dir ./data

This downloads:

  • Benchmark Parquet data from rhli/genarena (HuggingFace)
  • Official arena data (model outputs + battle logs) from rhli/genarena-battlefield

Environment Setup

Set your VLM API credentials:

export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"

For multi-endpoint support (load balancing and failover), use comma-separated values:

export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"

Run Evaluation

genarena run --arena_dir ./arena --data_dir ./data

View Leaderboard

genarena leaderboard --arena_dir ./arena --subset basic

Check Status

genarena status --arena_dir ./arena --data_dir ./data

Running Your Own Experiments

Directory Structure

To add your own model for evaluation, organize outputs in the following structure:

arena_dir/
└── <subset>/
    └── models/
        └── <GithubID>_<modelName>_<yyyymmdd>/
            └── <model_name>/
                ├── 000000.png
                ├── 000001.png
                └── ...

For example:

arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/

Generate Images with Diffgentor

Use Diffgentor to batch generate images for evaluation:

# Download benchmark data
hf download rhli/genarena --repo-type dataset --local-dir ./data

# Generate images with your model
diffgentor edit --backend diffusers \
    --model_name YourModel \
    --input ./data/basic/ \
    --output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/

Run Battles for New Models

genarena run --arena_dir ./arena --data_dir ./data \
    --subset basic \
    --exp_name yourname_YourModel_20260205

GenArena automatically detects new models and schedules battles against existing models.

Submit to Official Leaderboard

Coming Soon: The genarena submit command will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR.

The workflow will be:

  1. Run evaluation locally with genarena run
  2. Upload results to your HuggingFace repository
  3. Submit via genarena submit which creates a PR for review

Documentation

Document Description
Quick Start Installation and basic usage guide
Architecture System design and key concepts
CLI Reference Complete command-line interface documentation
Experiment Management How to organize and manage experiments
FAQ Frequently asked questions

Citation

TBD

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genarena-0.1.1.tar.gz (177.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genarena-0.1.1-py3-none-any.whl (177.1 kB view details)

Uploaded Python 3

File details

Details for the file genarena-0.1.1.tar.gz.

File metadata

  • Download URL: genarena-0.1.1.tar.gz
  • Upload date:
  • Size: 177.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.1.tar.gz
Algorithm Hash digest
SHA256 882bc2a996c7b27d0f833fe7c9358485c49103e36c59e97cce382215a4e81a5c
MD5 704c7acbe3ec00ff7f1119ece519a005
BLAKE2b-256 52f6a634887cec861a38ce6fea969a7b01eb0306dc6036f616f52cd3b13a7bbf

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.1.tar.gz:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file genarena-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: genarena-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 177.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d44c668c40f40c0d5ae72c7bc9aebd4298626be2ca2ef98e60e197b560c73548
MD5 2cdf4a8373346e0df943fcda213cbcc1
BLAKE2b-256 16fd47eb97af0cf9e4adaa9f89e6c20c4e050b47dbc1c056c65728ec31649147

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page