Skip to main content

GenArena Arena Evaluation - VLM-based pairwise image generation evaluation

Project description

GenArena

A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking.

arXiv Project Page Leaderboard Hugging Face

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

Quick Start

Installation

pip install genarena

Or install from source:

git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .

Initialize Arena

Download benchmark data and official arena data with one command:

genarena init --arena_dir ./arena --data_dir ./data

This downloads:

  • Benchmark Parquet data from rhli/genarena (HuggingFace)
  • Official arena data (model outputs + battle logs) from rhli/genarena-battlefield

Environment Setup

Set your VLM API credentials:

export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"

For multi-endpoint support (load balancing and failover), use comma-separated values:

export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"

Run Evaluation

genarena run --arena_dir ./arena --data_dir ./data

View Leaderboard

genarena leaderboard --arena_dir ./arena --subset basic

Check Status

genarena status --arena_dir ./arena --data_dir ./data

Running Your Own Experiments

Directory Structure

To add your own model for evaluation, organize outputs in the following structure:

arena_dir/
└── <subset>/
    └── models/
        └── <GithubID>_<modelName>_<yyyymmdd>/
            └── <model_name>/
                ├── 000000.png
                ├── 000001.png
                └── ...

For example:

arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/

Generate Images with Diffgentor

Use Diffgentor to batch generate images for evaluation:

# Download benchmark data
hf download rhli/genarena --repo-type dataset --local-dir ./data

# Generate images with your model
diffgentor edit --backend diffusers \
    --model_name YourModel \
    --input ./data/basic/ \
    --output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/

Run Battles for New Models

genarena run --arena_dir ./arena --data_dir ./data \
    --subset basic \
    --exp_name yourname_YourModel_20260205

GenArena automatically detects new models and schedules battles against existing models.

Submit to Official Leaderboard

Coming Soon: The genarena submit command will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR.

The workflow will be:

  1. Run evaluation locally with genarena run
  2. Upload results to your HuggingFace repository
  3. Submit via genarena submit which creates a PR for review

Documentation

Document Description
Quick Start Installation and basic usage guide
Architecture System design and key concepts
CLI Reference Complete command-line interface documentation
Experiment Management How to organize and manage experiments
FAQ Frequently asked questions

Citation

TBD

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genarena-0.1.0.tar.gz (177.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genarena-0.1.0-py3-none-any.whl (176.8 kB view details)

Uploaded Python 3

File details

Details for the file genarena-0.1.0.tar.gz.

File metadata

  • Download URL: genarena-0.1.0.tar.gz
  • Upload date:
  • Size: 177.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ccb2d0f7446f90a1fb5a1ba15718ebf453779ec4d9590d827d4a627332b44f3c
MD5 5ad479d7f4e0aa2051fd9c69c1d126af
BLAKE2b-256 57dd24b8068e81462b6e32166a5f8e5074d69b73607ea7d8e5eff5651380c6aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.0.tar.gz:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file genarena-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: genarena-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 176.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60a80d3ed84b5aed5352f5762610e16cf9abe0af73e542a2d77220fbcb236980
MD5 5f3347df5f4fe18562b2d8fb5d5f4841
BLAKE2b-256 9d79efda73a4915534d67132db8a42d5c06cf9396b548d85e2de08fa4379aa6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.0-py3-none-any.whl:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page