Benchmark evaluation for widget code generation — 12 quality metrics across layout, legibility, perceptual, style, and geometry.
Project description
widget2code-bench
Benchmark evaluation for widget code generation — 12 quality metrics across layout, legibility, perceptual, style, and geometry.
Installation
# 1. Install PyTorch with CUDA support first (skip if CPU-only)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
# 2. Install widget2code-bench
pip install widget2code-bench
Note: PyPI only ships CPU-only PyTorch. To use
--cuda, you must install PyTorch from the official index before installing this package.
Usage
Single image mode
Evaluate one GT-prediction pair. Prints JSON results to stdout, no files saved.
widget2code-bench \
--gt_image /path/to/gt.png \
--pred_image /path/to/pred.png \
--cuda
Batch mode
Evaluate all matched pairs in directories.
widget2code-bench \
--gt_dir /path/to/GT \ # /shared/zhixiang_team/widget_research/Comparison/GT
--pred_dir /path/to/predictions \
--pred_name output.png \
--cuda
Directory Structure (batch mode)
- GT dir: flat image files with 4-digit IDs in filenames (e.g.
gt_0001.png) - Pred dir: subfolders with 4-digit IDs in names, each containing
--pred_namefile
gt_dir/ pred_dir/
gt_0001.png image_0001/
gt_0002.png output.png
... image_0002/
output.png
Options
| Flag | Default | Description |
|---|---|---|
--gt_image |
— | Single GT image path |
--pred_image |
— | Single prediction image path |
--gt_dir |
— | GT directory (flat image files) |
--pred_dir |
— | Prediction directory (subfolders) |
--pred_name |
output.png |
Prediction filename inside each subfolder |
--output_dir |
{pred_dir}/.analysis |
Statistics output directory |
--workers |
4 | Parallel threads |
--cuda |
off | Enable GPU |
--skip_eval |
off | Skip evaluation, only regenerate statistics xlsx files from existing evaluation.json |
--minimal |
off | Skip per-metric visualization PNGs (default: verbose with viz) |
Output (batch mode)
Per-sample outputs
Every matched pair writes one evaluation.json plus (by default) a full per-metric
visualization set into its sample folder:
<pred_dir>/
image_0001/
output.png
evaluation/
evaluation.json # 12 metrics
viz/
MarginAsymmetry.png
ContentAspectDiff.png
AreaRatioDiff.png
TextJaccard.png
ContrastDiff.png
ContrastLocalDiff.png
PaletteDistance.png
Vibrancy.png
PolarityConsistency.png
ssim.png
lp.png
geo_score.png
Each viz PNG shows left/middle = GT/Pred intermediates and right = formula + intermediate values + final score, so you can see exactly how the metric was computed.
Pass --minimal to skip the viz/ directory (much faster, ~10x less disk).
Missing-prediction handling
The evaluator always produces all four fill modes. When a GT image has no matching prediction:
- Existing subfolder, pred missing → fill results go in the same folder's
evaluation/ - No subfolder at all → evaluator creates
pred_dir/fill_<id>/evaluation/
In either case it writes:
evaluation/
evaluation_black.json # GT vs all-black image
evaluation_white.json # GT vs all-white image
zero fill isn't a per-sample file — it's a worst-case contribution (LPIPS = 1.0, others = 0)
used only when aggregating the combined summary.
Aggregate outputs (.analysis/)
<pred_dir>/.analysis/
metrics_stats.json # per-metric quartiles/mean/std over matched pairs
metrics.xlsx # 4-row combined summary (raw/black/white/zero)
raw/<run>-raw-<ver>.xlsx # single-row summary per mode
black/<run>-black-<ver>.xlsx
white/<run>-white-<ver>.xlsx
zero/<run>-zero-<ver>.xlsx
| Mode | Description |
|---|---|
raw |
Matched pairs only (missing skipped) |
black |
Missing preds scored against an all-black image |
white |
Missing preds scored against an all-white image |
zero |
Missing preds contribute the worst-case value (LPIPS = 1.0, others = 0) |
All numeric values are rounded to 2 decimals. Combined metrics.xlsx has a two-level
header grouping metrics by category (Layout / Legibility / Style / Perceptual / Geometry)
plus SuccessRate (ratio, count). Per-mode xlsx uses flat single-level headers.
All metrics are higher-is-better except lp (LPIPS), which is a distance (lower-is-better).
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file widget2code_bench_exp-0.2.8.tar.gz.
File metadata
- Download URL: widget2code_bench_exp-0.2.8.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60aaad76e3849bd1f9efca783ebf761b68f4df0b2d0eadf9ef43412f0dec5f0f
|
|
| MD5 |
13f2bbeaea6831d8a9470bc0f7bb8772
|
|
| BLAKE2b-256 |
184f3330a8c9cc126fea5458f78b37411beee96666a9b4e8b47e5a70865ed2c1
|
File details
Details for the file widget2code_bench_exp-0.2.8-py3-none-any.whl.
File metadata
- Download URL: widget2code_bench_exp-0.2.8-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c1af18c793866ccc2be32eca4a2ba91069ad5bb0363ae5ade3e40093337909a
|
|
| MD5 |
a9464f60f710f604f35e1e98c9792031
|
|
| BLAKE2b-256 |
e7917a67cff955585459bad282cf8edd6914ffe6d2b0e9b8e27c0c9b92875ee7
|