GenArena Arena Evaluation - VLM-based pairwise image generation evaluation
Project description
GenArena
A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking.
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
Quick Start
Installation
pip install genarena
Or install from source:
git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .
Initialize Arena
Download benchmark data and official arena data with one command:
genarena init --arena_dir ./arena --data_dir ./data
This downloads:
- Benchmark Parquet data from
rhli/genarena(HuggingFace) - Official arena data (model outputs + battle logs) from
rhli/genarena-battlefield
Environment Setup
Set your VLM API credentials:
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"
For multi-endpoint support (load balancing and failover), use comma-separated values:
export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"
Run Evaluation
genarena run --arena_dir ./arena --data_dir ./data
View Leaderboard
genarena leaderboard --arena_dir ./arena --subset basic
Check Status
genarena status --arena_dir ./arena --data_dir ./data
Running Your Own Experiments
Directory Structure
To add your own model for evaluation, organize outputs in the following structure:
arena_dir/
└── <subset>/
└── models/
└── <GithubID>_<modelName>_<yyyymmdd>/
└── <model_name>/
├── 000000.png
├── 000001.png
└── ...
For example:
arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/
Generate Images with Diffgentor
Use Diffgentor to batch generate images for evaluation:
# Download benchmark data
hf download rhli/genarena --repo-type dataset --local-dir ./data
# Generate images with your model
diffgentor edit --backend diffusers \
--model_name YourModel \
--input ./data/basic/ \
--output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/
Run Battles for New Models
genarena run --arena_dir ./arena --data_dir ./data \
--subset basic \
--exp_name yourname_YourModel_20260205
GenArena automatically detects new models and schedules battles against existing models.
Submit to Official Leaderboard
Coming Soon: The
genarena submitcommand will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR.
The workflow will be:
- Run evaluation locally with
genarena run - Upload results to your HuggingFace repository
- Submit via
genarena submitwhich creates a PR for review
Documentation
| Document | Description |
|---|---|
| Quick Start | Installation and basic usage guide |
| Architecture | System design and key concepts |
| CLI Reference | Complete command-line interface documentation |
| Experiment Management | How to organize and manage experiments |
| FAQ | Frequently asked questions |
Citation
TBD
License
Apache License 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genarena-0.1.0.tar.gz.
File metadata
- Download URL: genarena-0.1.0.tar.gz
- Upload date:
- Size: 177.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccb2d0f7446f90a1fb5a1ba15718ebf453779ec4d9590d827d4a627332b44f3c
|
|
| MD5 |
5ad479d7f4e0aa2051fd9c69c1d126af
|
|
| BLAKE2b-256 |
57dd24b8068e81462b6e32166a5f8e5074d69b73607ea7d8e5eff5651380c6aa
|
Provenance
The following attestation bundles were made for genarena-0.1.0.tar.gz:
Publisher:
publish.yml on ruihanglix/genarena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genarena-0.1.0.tar.gz -
Subject digest:
ccb2d0f7446f90a1fb5a1ba15718ebf453779ec4d9590d827d4a627332b44f3c - Sigstore transparency entry: 919260742
- Sigstore integration time:
-
Permalink:
ruihanglix/genarena@58b243a522b69e6050654198b66ecca06fae1230 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ruihanglix
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@58b243a522b69e6050654198b66ecca06fae1230 -
Trigger Event:
push
-
Statement type:
File details
Details for the file genarena-0.1.0-py3-none-any.whl.
File metadata
- Download URL: genarena-0.1.0-py3-none-any.whl
- Upload date:
- Size: 176.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60a80d3ed84b5aed5352f5762610e16cf9abe0af73e542a2d77220fbcb236980
|
|
| MD5 |
5f3347df5f4fe18562b2d8fb5d5f4841
|
|
| BLAKE2b-256 |
9d79efda73a4915534d67132db8a42d5c06cf9396b548d85e2de08fa4379aa6f
|
Provenance
The following attestation bundles were made for genarena-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ruihanglix/genarena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genarena-0.1.0-py3-none-any.whl -
Subject digest:
60a80d3ed84b5aed5352f5762610e16cf9abe0af73e542a2d77220fbcb236980 - Sigstore transparency entry: 919260751
- Sigstore integration time:
-
Permalink:
ruihanglix/genarena@58b243a522b69e6050654198b66ecca06fae1230 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ruihanglix
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@58b243a522b69e6050654198b66ecca06fae1230 -
Trigger Event:
push
-
Statement type: