Skip to main content

GenArena Arena Evaluation - VLM-based pairwise image generation evaluation

Project description

GenArena

A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking.

arXiv Project Page Leaderboard Hugging Face

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

Quick Start

Installation

pip install genarena

Or install from source:

git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .

Initialize Arena

Download benchmark data and official arena data with one command:

genarena init --arena_dir ./arena --data_dir ./data

This downloads:

  • Benchmark Parquet data from rhli/genarena (HuggingFace)
  • Official arena data (model outputs + battle logs) from rhli/genarena-battlefield

Environment Setup

Set your VLM API credentials:

export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"

For multi-endpoint support (load balancing and failover), use comma-separated values:

export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"

Run Evaluation

genarena run --arena_dir ./arena --data_dir ./data

View Leaderboard

genarena leaderboard --arena_dir ./arena --subset basic

Check Status

genarena status --arena_dir ./arena --data_dir ./data

Running Your Own Experiments

Directory Structure

To add your own model for evaluation, organize outputs in the following structure:

arena_dir/
└── <subset>/
    └── models/
        └── <GithubID>_<modelName>_<yyyymmdd>/
            └── <model_name>/
                ├── 000000.png
                ├── 000001.png
                └── ...

For example:

arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/

Generate Images with Diffgentor

Use Diffgentor to batch generate images for evaluation:

# Download benchmark data
hf download rhli/genarena --repo-type dataset --local-dir ./data

# Generate images with your model
diffgentor edit --backend diffusers \
    --model_name YourModel \
    --input ./data/basic/ \
    --output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/

Run Battles for New Models

genarena run --arena_dir ./arena --data_dir ./data \
    --subset basic \
    --exp_name yourname_YourModel_20260205

GenArena automatically detects new models and schedules battles against existing models.

Submit to Official Leaderboard

Coming Soon: The genarena submit command will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR.

The workflow will be:

  1. Run evaluation locally with genarena run
  2. Upload results to your HuggingFace repository
  3. Submit via genarena submit which creates a PR for review

Documentation

Document Description
Quick Start Installation and basic usage guide
Architecture System design and key concepts
CLI Reference Complete command-line interface documentation
Experiment Management How to organize and manage experiments
FAQ Frequently asked questions

Citation

TBD

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genarena-0.1.2.tar.gz (178.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genarena-0.1.2-py3-none-any.whl (177.7 kB view details)

Uploaded Python 3

File details

Details for the file genarena-0.1.2.tar.gz.

File metadata

  • Download URL: genarena-0.1.2.tar.gz
  • Upload date:
  • Size: 178.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.2.tar.gz
Algorithm Hash digest
SHA256 49b33fecefeca962c1938889f38f5bd92bb599a98ac146b4bd10cd92e22f3dac
MD5 cfc363c57b540c5558f7b3a35e9288ec
BLAKE2b-256 5f593e7b06058a63f26acfb1f294cc4224ca37cb9ee87206a55aa791de854a57

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.2.tar.gz:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file genarena-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: genarena-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 177.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for genarena-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 76a03d03abcf7899e5d0a50cda9b6e1545b31f6dfc6e372cc90ecde3f3384c32
MD5 65837ed2746ad13edc84b953227bd335
BLAKE2b-256 4baa662becaf853540251f88b115e9dce0936ade5ef7fbedff924fe55afe6fdb

See more details on using hashes here.

Provenance

The following attestation bundles were made for genarena-0.1.2-py3-none-any.whl:

Publisher: publish.yml on ruihanglix/genarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page