Benchmark for evaluating LLM understanding of web UI: SiFR vs HTML vs AXTree vs Screenshots
Project description
SiFR Benchmark
How well do AI agents understand web UI?
Benchmark comparing SiFR vs HTML vs AXTree vs Screenshots across complex websites.
โ ๏ธ This is an example run, not a definitive study. The benchmark is fully reproducible โ run it yourself on your sites, your models, your use cases. We show our results; you verify on yours.
Results
Tested on 10 high-complexity sites: Amazon, YouTube, Reddit, eBay, Walmart, Airbnb, Yelp, IMDB, ESPN, GitHub.
All formats tested with equal 400KB token budget for fair comparison.
| Format | Accuracy | Tokens (avg) |
|---|---|---|
| SiFR | 71.7% | 102K |
| Screenshot | 27.0% | 38K |
| Raw HTML | 11.4% | 122K |
| AXTree | 1.5% | 6K |
SiFR is 2.7x more accurate than screenshots and 6.3x more accurate than raw HTML.
Per-Site Breakdown
| Site | SiFR | Screenshot | HTML | AXTree |
|---|---|---|---|---|
| GitHub | ๐ 100% | 0% | โ | 0% |
| YouTube | ๐ 100% | 64% | 0% | 0% |
| Amazon | ๐ 85.7% | 22.9% | โ | 0% |
| Walmart | ๐ 85.7% | 13.3% | 11.4% | 0% |
| ๐ 83.3% | 36% | โ | 0% | |
| Yelp | ๐ 62.5% | 57.1% | โ | 0% |
| ESPN | ๐ 57.1% | 11.4% | 22.9% | 0% |
| IMDB | ๐ 50% | 16% | โ | 16.7% |
| eBay | ๐ 28.6% | 26.7% | 11.4% | 0% |
SiFR wins on 9 out of 9 sites where it ran successfully.
What is SiFR?
Structured Interface Format for Representation โ a compact format optimized for LLM understanding of web UI.
a015:
tag: a
text: "Add to Cart"
box: [500, 300, 120, 40]
attrs: {href: "/cart/add", class: "btn-primary"}
salience: high
Key advantages:
- Actionable IDs: Every element gets a unique ID (
a015,btn003) - Salience scoring: High/medium/low importance ranking
- Structured for LLMs: Optimized for "find element โ take action" tasks
- Model-agnostic: Works with any LLM that can read text
Installation
pip install sifr-benchmark
Prerequisites
-
Element-to-LLM Chrome Extension โ captures pages in SiFR format
- Load unpacked from
element-to-llm-chrome/
- Load unpacked from
-
API Keys
export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-ant-... # optional
-
Playwright (for automated capture)
playwright install chromium
Quick Start
Full Benchmark
Capture โ Generate Ground Truth โ Test โ all in one command:
sifr-bench full-benchmark-e2llm https://www.amazon.com https://www.youtube.com \
-e /path/to/element-to-llm-extension \
-s 400
Options:
-e, --extensionโ Path to E2LLM extension (required)-s, --target-sizeโ Token budget in KB for ALL formats (default: 400)-m, --modelsโ Models to test (default: gpt-4o-mini)-v, --verboseโ Show detailed output
Other Commands
# List all benchmark runs
sifr-bench list-runs
# Compare multiple runs
sifr-bench compare benchmark_runs/run_1 benchmark_runs/run_2
# Validate SiFR files
sifr-bench validate examples/
# Show help
sifr-bench info
How It Works
1. Capture
The extension captures 4 formats simultaneously:
- SiFR โ Structured format with salience scoring
- HTML โ Raw rendered DOM (
outerHTML) - AXTree โ Playwright accessibility tree
- Screenshot โ Full-page PNG
2. Ground Truth Generation
GPT-4o Vision analyzes screenshot + SiFR to generate agent tasks:
- Click: "Click the Sign In button" โ
a003 - Input: "Enter search query" โ
input001 - Locate: "Find the main heading" โ
h1001
3. Benchmark
Each format tested with same token budget, same model, same prompts:
Task: "Click on the shopping cart icon"
Expected: a015
SiFR response: a015 โ
HTML response: none โ
Screenshot response: cart icon (no ID) โ
Methodology Notes
Run it yourself. This benchmark exists so you can test on your own sites and models. Our results are one data point โ your results on your use case matter more.
-
Equal token budget: All formats truncated to same size (400KB default). Fair comparison.
-
Ground truth is auto-generated: GPT-4o Vision creates tasks. For production, consider human verification.
-
AXTree 0% is a real finding: Many agent frameworks use accessibility trees. This shows why that's problematic.
-
7 tasks per site: Practical, not academic. When did you last need 2000 clicks on one page?
Why Raw HTML Fails
Amazon HTML: 909KB original
After truncation: 400KB (loses 56% of content)
Result: 0% accuracy โ critical elements gone
Amazon SiFR: 613KB original
After truncation: 400KB (loses 35% of content)
Result: 85.7% accuracy โ structure survives
HTML is verbose. When you truncate it, you lose random chunks. SiFR is pre-compressed with salience scoring โ important elements survive truncation.
Output Format
Benchmark Results: Combined (10 sites)
โโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโ
โ Format โ Accuracy โ Tokens โ Latency โ Status โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ sifr โ 71.7% โ 101,683 โ 30,221ms โ โ
โ
โ screenshot โ 27.0% โ 38,074 โ 7,942ms โ โ ๏ธ โ
โ html_raw โ 11.4% โ 122,190 โ 35,901ms โ โ ๏ธ โ
โ axtree โ 1.5% โ 6,044 โ 2,034ms โ โ ๏ธ โ
โโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโ
Status:
- โ Success (accuracy โฅ 50%)
- โ ๏ธ Warning (accuracy < 50%)
- โ Failed (accuracy = 0%)
Run Directory Structure
Each benchmark creates an isolated run:
benchmark_runs/run_20251206_210357/
โโโ captures/
โ โโโ sifr/*.sifr
โ โโโ html/*.html
โ โโโ axtree/*.json
โ โโโ screenshots/*.png
โโโ ground-truth/*.json
โโโ results/
โ โโโ raw_results.json
โ โโโ summary.json
โโโ run_meta.json
Tested Models
Default: gpt-4o-mini
The benchmark supports any OpenAI or Anthropic model. Run with different models:
sifr-bench full-benchmark-e2llm ... -m gpt-4o
sifr-bench full-benchmark-e2llm ... -m claude-sonnet
Contributing
- Add test sites: Run benchmark on more URLs
- Improve ground truth: Manual verification of tasks
- New models: Add support in
models.py - Bug reports: Open an issue
Citation
@misc{sifr2025,
title={SiFR: Structured Interface Format for AI Web Agents},
author={SiFR Contributors},
year={2025},
url={https://github.com/Alechko375/sifr-benchmark}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sifr_benchmark-0.1.23.tar.gz.
File metadata
- Download URL: sifr_benchmark-0.1.23.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0ea481c68244164ec43926b1ff112c01def5b0857d8c77e3ccf4ddec4944b28
|
|
| MD5 |
61986128bd0e20c072dd8b8f8bd35e0b
|
|
| BLAKE2b-256 |
bcf0e8d94822510284470d886728e415ff6625a36f5d1ad511bedfc00d2882ea
|
Provenance
The following attestation bundles were made for sifr_benchmark-0.1.23.tar.gz:
Publisher:
benchmark.yml on Alechko375/sifr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifr_benchmark-0.1.23.tar.gz -
Subject digest:
a0ea481c68244164ec43926b1ff112c01def5b0857d8c77e3ccf4ddec4944b28 - Sigstore transparency entry: 747684454
- Sigstore integration time:
-
Permalink:
Alechko375/sifr-benchmark@ec04feaf844c3cd57fa2195651ee69dcb2284574 -
Branch / Tag:
refs/tags/v0.1.23 - Owner: https://github.com/Alechko375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
benchmark.yml@ec04feaf844c3cd57fa2195651ee69dcb2284574 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sifr_benchmark-0.1.23-py3-none-any.whl.
File metadata
- Download URL: sifr_benchmark-0.1.23-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68882c204b6ea6953a6df89131ce5869e8443c838770c527ed57baf903669deb
|
|
| MD5 |
d2893f704fbd3dd7486f2ec66e6af672
|
|
| BLAKE2b-256 |
74447cbd740e77860441b7060a6d1dcdbda90257620cfc2b665516b7a6ef28d9
|
Provenance
The following attestation bundles were made for sifr_benchmark-0.1.23-py3-none-any.whl:
Publisher:
benchmark.yml on Alechko375/sifr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifr_benchmark-0.1.23-py3-none-any.whl -
Subject digest:
68882c204b6ea6953a6df89131ce5869e8443c838770c527ed57baf903669deb - Sigstore transparency entry: 747684459
- Sigstore integration time:
-
Permalink:
Alechko375/sifr-benchmark@ec04feaf844c3cd57fa2195651ee69dcb2284574 -
Branch / Tag:
refs/tags/v0.1.23 - Owner: https://github.com/Alechko375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
benchmark.yml@ec04feaf844c3cd57fa2195651ee69dcb2284574 -
Trigger Event:
push
-
Statement type: