Benchmark for evaluating LLM understanding of web UI: SiFR vs HTML vs AXTree vs Screenshots
Project description
SiFR Benchmark
How well do AI agents understand web UI?
Benchmark comparing SiFR vs HTML vs AXTree vs Screenshots across complex websites.
⚠️ This is an example run, not a definitive study. The benchmark is fully reproducible — run it yourself on your sites, your models, your use cases.
Results
Tested on Amazon with 300KB token budget, compound tasks (understand → act).
| Format | Understand | Act | Combined | Tokens |
|---|---|---|---|---|
| SiFR | 100% | 25% | 25% | 173K |
| HTML | 100% | 0% | 0% | 194K |
| AXTree | 100% | 25% | 25% | 27K |
| Screenshot | 75% | 0% | 0% | 51K |
Key insight: HTML understands perfectly but can't act. Screenshot sees the page but has no element IDs. Only SiFR and AXTree can both understand AND act.
Budget Matters
| Budget | SiFR Combined | HTML Combined | Winner |
|---|---|---|---|
| 300KB | 25% | 0% | SiFR |
| 100KB | 0% | 50% | HTML |
- Large pages (300KB+): SiFR wins — structure survives truncation
- Small pages (100KB): HTML wins — less overhead, more content
What is SiFR?
Structured Interface Format for Representation — JSON format optimized for LLM understanding of web UI.
{
"id": "a015",
"tag": "a",
"text": "Add to Cart",
"bbox": [500, 300, 120, 40],
"children": []
}
Key advantages:
- Actionable IDs: Every element gets a unique ID (
a015,btn003) - Bounding boxes: Pixel-perfect positions for design tasks
- Structured JSON: LLMs understand JSON natively
- Hierarchical: Parent-child relationships preserved
Installation
pip install sifr-benchmark
Prerequisites
- Element-to-LLM Chrome Extension — captures pages in SiFR format
- API Keys
export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-ant-... # optional
- Playwright
playwright install chromium
Quick Start
Full Benchmark
sifr-bench full-benchmark-e2llm https://www.amazon.com \
-e /path/to/element-to-llm-extension \
-s 300 \
--mode compound
Benchmark Modes
🤖 Compound Tasks (AI Agents)
Understanding → Action pairs for autonomous agents.
sifr-bench full-benchmark-e2llm https://amazon.com -e /path/to/ext --mode compound
Tasks:
- "Which product has the highest rating?" → "Click on it"
- "Find items under $50" → "Add to cart"
- "What's the top news story?" → "Open comments"
👨💻 Dev Tasks (Frontend Developers)
Selectors, accessibility, structure analysis.
sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode dev
Tasks:
- "What's a stable selector for the login button?" →
btn042 - "Which images are missing alt text?" →
3 images - "List all form inputs on the page" →
email, password, submit - "Find buttons without aria-labels" →
btn005, btn012
Why SiFR wins for devs:
- Stable IDs vs fragile CSS selectors
- Element inventory built-in
- No DOM parsing needed
🎨 Design Tasks (UI/UX Designers)
Spacing, typography, consistency checks.
sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode design
Tasks:
- "What's the height of the hero section?" →
~500px - "Are all cards the same width?" →
Yes, 4 columns - "How many button variants exist?" →
3 styles - "What's the gap between nav items?" →
24px
Why SiFR wins for designers:
bboxprovides exact pixel measurements- Can calculate spacing mathematically
- No visual estimation needed
🔄 Combined Mode
Run all task types at once.
sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode combined -v
Options
| Option | Description | Default |
|---|---|---|
-e, --extension |
Path to E2LLM extension | required |
-s, --target-size |
Token budget in KB | 400 |
-m, --models |
Models to test | gpt-4o-mini |
--mode |
Task type: compound/dev/design/combined | compound |
-v, --verbose |
Show detailed output | false |
Multi-Model Comparison
sifr-bench full-benchmark-e2llm https://amazon.com \
-e /path/to/ext \
-s 300 \
-m gpt-4o-mini,gpt-4o,claude-haiku
Supported Models
| Model | Alias | Vision |
|---|---|---|
| GPT-4o | gpt-4o |
✅ |
| GPT-4o Mini | gpt-4o-mini |
✅ |
| GPT-4 Turbo | gpt-4-turbo |
✅ |
| Claude Sonnet 4 | claude-sonnet |
✅ |
| Claude Haiku 4.5 | claude-haiku |
✅ |
| Claude Opus 4 | claude-opus |
✅ |
Output Examples
Compound Tasks
Understanding + Action Results: amazon.com
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ Format ┃ Understand ┃ Act ┃ Combined ┃ Tokens ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sifr │ 100% │ 25% │ 25% │ 172,794 │
│ html_raw │ 100% │ 0% │ 0% │ 194,367 │
│ axtree │ 100% │ 25% │ 25% │ 27,223 │
│ screenshot │ 75% │ 0% │ 0% │ 51,162 │
└────────────┴────────────┴─────┴──────────┴─────────┘
Dev Tasks
Developer Tasks: stripe.com
┏━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
┃ Format ┃ Selector ┃ A11y ┃ Structure ┃ Overall ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
│ sifr │ 80% │ 60% │ 100% │ 75% │
│ html_raw │ 40% │ 80% │ 60% │ 55% │
│ axtree │ 20% │ 100% │ 80% │ 60% │
│ screenshot │ 0% │ 40% │ 40% │ 25% │
└────────────┴──────────┴──────┴───────────┴─────────┘
Design Tasks
Design Tasks: stripe.com
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Format ┃ Spacing ┃ Typography ┃ Consistency ┃ Overall ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
│ sifr │ 90% │ 60% │ 70% │ 75% │
│ screenshot │ 70% │ 80% │ 60% │ 70% │
│ html_raw │ 20% │ 40% │ 50% │ 35% │
│ axtree │ 10% │ 30% │ 40% │ 25% │
└────────────┴─────────┴────────────┴─────────────┴─────────┘
Other Commands
# List all benchmark runs
sifr-bench list-runs
# Compare multiple runs
sifr-bench compare benchmark_runs/run_1 benchmark_runs/run_2
# Validate SiFR files
sifr-bench validate examples/
# Show help
sifr-bench info
Run Directory Structure
benchmark_runs/run_20251208_093517/
├── captures/
│ ├── sifr/*.sifr
│ ├── html/*.html
│ ├── axtree/*.json
│ └── screenshots/*.png
├── ground-truth/*.json
├── results/
│ ├── raw_results.json
│ └── summary.json
└── run_meta.json
Why Each Format Fails
| Format | Understand | Act | Why |
|---|---|---|---|
| SiFR | ✅ JSON structure | ✅ Has IDs | Best of both worlds |
| HTML | ✅ Full content | ❌ No stable IDs | Can read, can't click |
| AXTree | ✅ Semantic | ⚠️ Own IDs | IDs don't match page |
| Screenshot | ✅ Visual | ❌ No IDs at all | Sees but can't act |
Use Cases
For AI Agent Developers
- Test agent accuracy before deployment
- Compare different LLM backends
- Benchmark against baselines
For Frontend Developers
- Generate stable test selectors
- Audit accessibility issues
- Analyze component structure
For UI/UX Designers
- Verify design system consistency
- Check spacing and typography
- Audit visual hierarchy
Contributing
- Add test sites: Run benchmark on more URLs
- Improve ground truth: Manual verification
- New models: Add support in
models.py - Bug reports: Open an issue
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sifr_benchmark-0.1.38.tar.gz.
File metadata
- Download URL: sifr_benchmark-0.1.38.tar.gz
- Upload date:
- Size: 47.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d538d427ce9cf8ede6ba82587672474dc390ea26807e26d589efc6a58da44b87
|
|
| MD5 |
3943160167e14b3cff4f08e4f323e171
|
|
| BLAKE2b-256 |
fe3ffc90f214a013309f8ce1bb4c2ee002c3341c8e567cd8a9a5c0794e524e0d
|
Provenance
The following attestation bundles were made for sifr_benchmark-0.1.38.tar.gz:
Publisher:
benchmark.yml on Alechko375/sifr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifr_benchmark-0.1.38.tar.gz -
Subject digest:
d538d427ce9cf8ede6ba82587672474dc390ea26807e26d589efc6a58da44b87 - Sigstore transparency entry: 748680948
- Sigstore integration time:
-
Permalink:
Alechko375/sifr-benchmark@021bd330899e4132360d1d1e4fd77aaa2422c7b7 -
Branch / Tag:
refs/tags/v0.1.38 - Owner: https://github.com/Alechko375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
benchmark.yml@021bd330899e4132360d1d1e4fd77aaa2422c7b7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sifr_benchmark-0.1.38-py3-none-any.whl.
File metadata
- Download URL: sifr_benchmark-0.1.38-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
024a19d8233934b7c7bc8f046ec4fdb68758c7bc9eb08494204bf24a5cb7e5c7
|
|
| MD5 |
734dd7206aeef1d32eb843b38515832f
|
|
| BLAKE2b-256 |
6e4d82dbdc1af5f73eff6c289740672d4b0d28a1dc66b342f525157c6b15d1d9
|
Provenance
The following attestation bundles were made for sifr_benchmark-0.1.38-py3-none-any.whl:
Publisher:
benchmark.yml on Alechko375/sifr-benchmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sifr_benchmark-0.1.38-py3-none-any.whl -
Subject digest:
024a19d8233934b7c7bc8f046ec4fdb68758c7bc9eb08494204bf24a5cb7e5c7 - Sigstore transparency entry: 748680950
- Sigstore integration time:
-
Permalink:
Alechko375/sifr-benchmark@021bd330899e4132360d1d1e4fd77aaa2422c7b7 -
Branch / Tag:
refs/tags/v0.1.38 - Owner: https://github.com/Alechko375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
benchmark.yml@021bd330899e4132360d1d1e4fd77aaa2422c7b7 -
Trigger Event:
push
-
Statement type: