Benchmark for evaluating LLM understanding of web UI: SiFR vs HTML vs AXTree vs Screenshots

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Alechko375

These details have not been verified by PyPI

Project description

SiFR Benchmark

How well do AI agents understand web UI?

Benchmark comparing SiFR vs HTML vs AXTree vs Screenshots across complex websites.

⚠️ This is an example run, not a definitive study. The benchmark is fully reproducible — run it yourself on your sites, your models, your use cases.

Results

Tested on Amazon with 300KB token budget, compound tasks (understand → act).

Format	Understand	Act	Combined	Tokens
SiFR	100%	25%	25%	173K
HTML	100%	0%	0%	194K
AXTree	100%	25%	25%	27K
Screenshot	75%	0%	0%	51K

Key insight: HTML understands perfectly but can't act. Screenshot sees the page but has no element IDs. Only SiFR and AXTree can both understand AND act.

Budget Matters

Budget	SiFR Combined	HTML Combined	Winner
300KB	25%	0%	SiFR
100KB	0%	50%	HTML

Large pages (300KB+): SiFR wins — structure survives truncation
Small pages (100KB): HTML wins — less overhead, more content

What is SiFR?

Structured Interface Format for Representation — JSON format optimized for LLM understanding of web UI.

{
  "id": "a015",
  "tag": "a",
  "text": "Add to Cart",
  "bbox": [500, 300, 120, 40],
  "children": []
}

Key advantages:

Actionable IDs: Every element gets a unique ID (a015, btn003)
Bounding boxes: Pixel-perfect positions for design tasks
Structured JSON: LLMs understand JSON natively
Hierarchical: Parent-child relationships preserved

Installation

pip install sifr-benchmark

Prerequisites

Element-to-LLM Chrome Extension — captures pages in SiFR format
API Keys

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...  # optional

Playwright

playwright install chromium

Quick Start

sifr-bench full-benchmark-e2llm https://www.amazon.com \
  -e /path/to/element-to-llm-extension \
  -s 300 \
  --mode compound \
  -v

How It Works

Single Session Architecture

The benchmark runs in a single page session — no reload between capture and verification:

┌─────────────────────────────────────────────────────────┐
│                    SINGLE PAGE SESSION                   │
├─────────────────────────────────────────────────────────┤
│  1. Load page         → page.goto(url)                  │
│  2. Capture formats   → SiFR, HTML, AXTree, Screenshot  │
│  3. Generate tasks    → GPT-4o vision                   │
│  4. Query LLM         → understand + act                │
│  5. Verify on page    → Playwright trial click          │
│  6. Next URL          → repeat                          │
└─────────────────────────────────────────────────────────┘

Why this matters: Dynamic pages (carousels, recommendations, A/B tests) change on reload. Single session ensures the element IDs from capture match the actual page during verification.

Verification Pipeline

Act success is measured by functional testing, not text matching:

LLM Response    →    Resolve ID    →    Trial Click    →    Success?
   "a012"       →    "#product-1"   →    click(trial)   →    ✓/✗

Verification stages:

Parse — extract element ID from response
Resolve — ID → CSS selector (via SiFR data)
Find — selector → element on page
Visible — element is visible?
Click — element is clickable?

Use --debug to see exactly where verification fails.

Benchmark Modes

🤖 Compound Tasks (AI Agents)

Understanding → Action pairs for autonomous agents.

sifr-bench full-benchmark-e2llm https://amazon.com -e /path/to/ext --mode compound

Tasks:

"Which product has the highest rating?" → "Click on it"
"Find items under $50" → "Add to cart"
"What's the top news story?" → "Open comments"

👨‍💻 Dev Tasks (Frontend Developers)

Selectors, accessibility, structure analysis.

sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode dev

Tasks:

"What's a stable selector for the login button?" → btn042
"Which images are missing alt text?" → 3 images
"List all form inputs on the page" → email, password, submit

🎨 Design Tasks (UI/UX Designers)

Spacing, typography, consistency checks.

sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode design

Tasks:

"What's the height of the hero section?" → ~500px
"Are all cards the same width?" → Yes, 4 columns
"How many button variants exist?" → 3 styles

🔄 Combined Mode

Run all task types at once.

sifr-bench full-benchmark-e2llm https://stripe.com -e /path/to/ext --mode combined -v

Options

Option	Description	Default
`-e, --extension`	Path to E2LLM extension	required
`-s, --target-size`	Token budget in KB	400
`-m, --models`	Models to test (comma-separated)	gpt-4o-mini
`--mode`	Task type: compound/dev/design/combined	compound
`-v, --verbose`	Show per-task results	false
`--debug`	Enable verification logging	false

Multi-Model Comparison

sifr-bench full-benchmark-e2llm https://amazon.com \
  -e /path/to/ext \
  -s 300 \
  -m gpt-4o-mini,gpt-4o,claude-haiku

Supported Models

Model	Alias	Vision
GPT-4o	`gpt-4o`	✅
GPT-4o Mini	`gpt-4o-mini`	✅
GPT-4 Turbo	`gpt-4-turbo`	✅
Claude Sonnet 4	`claude-sonnet`	✅
Claude Haiku 4.5	`claude-haiku`	✅
Claude Opus 4	`claude-opus`	✅

Output Examples

Compound Tasks

Understanding + Action Results: amazon.com
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ Format     ┃ Understand ┃ Act ┃ Combined ┃  Tokens ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sifr       │       100% │ 25% │      25% │ 172,794 │
│ html_raw   │       100% │  0% │       0% │ 194,367 │
│ axtree     │       100% │ 25% │      25% │  27,223 │
│ screenshot │        75% │  0% │       0% │  51,162 │
└────────────┴────────────┴─────┴──────────┴─────────┘

Verbose Output (-v)

━━━ https://amazon.com ━━━
  Loading page...
  ✓ Captured (SiFR: 287KB)
  ✓ 4 tasks
  Running benchmark...
    cmp_01 [sifr]: U✅ A✅ | Shop gifts by cate... → a001
    cmp_02 [sifr]: U✅ A❌ | Popular products... → a012
      ↳ visible: Element not visible (hidden or off-screen)
    cmp_03 [html_raw]: U✅ A❌ | Wireless Earbuds... → .product-card
      ↳ find: Element not found on page

Debug Output (--debug)

14:23:01 [sifr.verification] [SiFR] Resolved a012 → #product-link-xyz
14:23:01 [sifr.verification] [Verify] Found 1 element(s)
14:23:01 [sifr.verification] [Verify] Not visible: #product-link-xyz
14:23:01 [sifr.verification] [sifr] FAIL: ✗ [visible] Element not visible | id=a012 → sel=#product-link-xyz → found=1

Run Directory Structure

benchmark_runs/run_20251208_093517/
├── captures/
│   ├── sifr/*.sifr
│   ├── html/*.html
│   ├── axtree/*.json
│   └── screenshots/*.png
├── ground-truth/*.json
├── results/
│   ├── raw_results.json      # Full results with verification details
│   └── summary.json
└── run_meta.json             # Includes "single_session": true

Why Each Format Fails

Format	Understand	Act	Why
SiFR	✅ JSON structure	✅ Has IDs	Best of both worlds
HTML	✅ Full content	❌ No stable IDs	Can read, can't click
AXTree	✅ Semantic	⚠️ Own IDs	IDs don't match page
Screenshot	✅ Visual	❌ No IDs at all	Sees but can't act

Other Commands

# List all benchmark runs
sifr-bench list-runs

# Validate SiFR files
sifr-bench validate examples/

# Show help
sifr-bench info

Use Cases

For AI Agent Developers

Test agent accuracy before deployment
Compare different LLM backends
Benchmark against baselines

For Frontend Developers

Generate stable test selectors
Audit accessibility issues
Analyze component structure

For UI/UX Designers

Verify design system consistency
Check spacing and typography
Audit visual hierarchy

Contributing

Add test sites: Run benchmark on more URLs
Improve ground truth: Manual verification
New models: Add support in models.py
Bug reports: Open an issue

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Alechko375

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.39

Dec 14, 2025

0.1.38

Dec 8, 2025

0.1.37

Dec 8, 2025

0.1.36

Dec 8, 2025

0.1.35

Dec 8, 2025

0.1.34

Dec 7, 2025

0.1.33

Dec 7, 2025

0.1.32

Dec 7, 2025

0.1.31

Dec 7, 2025

0.1.30

Dec 7, 2025

0.1.29

Dec 7, 2025

0.1.28

Dec 7, 2025

0.1.27

Dec 7, 2025

0.1.26

Dec 7, 2025

0.1.25

Dec 7, 2025

0.1.24

Dec 7, 2025

0.1.23

Dec 7, 2025

0.1.22

Dec 7, 2025

0.1.21

Dec 6, 2025

0.1.20

Dec 6, 2025

0.1.19

Dec 6, 2025

0.1.18

Dec 6, 2025

0.1.17

Dec 6, 2025

0.1.15

Dec 5, 2025

0.1.14

Dec 5, 2025

0.1.13

Dec 5, 2025

0.1.12

Dec 5, 2025

0.1.11

Dec 5, 2025

0.1.10

Dec 5, 2025

0.1.9

Dec 5, 2025

0.1.8

Dec 5, 2025

0.1.4

Dec 3, 2025

0.1.3

Dec 3, 2025

0.1.1

Dec 3, 2025

0.1.0

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifr_benchmark-0.1.39.tar.gz (49.0 kB view details)

Uploaded Dec 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sifr_benchmark-0.1.39-py3-none-any.whl (38.6 kB view details)

Uploaded Dec 14, 2025 Python 3

File details

Details for the file sifr_benchmark-0.1.39.tar.gz.

File metadata

Download URL: sifr_benchmark-0.1.39.tar.gz
Upload date: Dec 14, 2025
Size: 49.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifr_benchmark-0.1.39.tar.gz
Algorithm	Hash digest
SHA256	`c16c5a15c9de4c106e616e414e28b2451b39f001cafc383b5dfab23b74f89a30`
MD5	`6f6926dc4029904c14f859f0376156e0`
BLAKE2b-256	`bbe9581969b6b73d4871edc29a5b24704ccac4b0327808f21987ebbfc9a16be5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.39.tar.gz:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sifr_benchmark-0.1.39.tar.gz
- Subject digest: c16c5a15c9de4c106e616e414e28b2451b39f001cafc383b5dfab23b74f89a30
- Sigstore transparency entry: 763778685
- Sigstore integration time: Dec 14, 2025
Source repository:
- Permalink: Alechko375/sifr-benchmark@d8e78d656645ae30d385eb184b584a179a41ad1b
- Branch / Tag: refs/tags/v0.1.39
- Owner: https://github.com/Alechko375
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: benchmark.yml@d8e78d656645ae30d385eb184b584a179a41ad1b
- Trigger Event: push

File details

Details for the file sifr_benchmark-0.1.39-py3-none-any.whl.

File metadata

Download URL: sifr_benchmark-0.1.39-py3-none-any.whl
Upload date: Dec 14, 2025
Size: 38.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifr_benchmark-0.1.39-py3-none-any.whl
Algorithm	Hash digest
SHA256	`13d48f94a3c20f15762eeae2e75b2d5a2ce1134cd98253a291185b6febf586c4`
MD5	`552b3d68b23e3ad59dedc721774bd9dc`
BLAKE2b-256	`f95e50edcf406b913cc414c06b48da4bb008621732bfcc594267429b1b41de07`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.39-py3-none-any.whl:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sifr_benchmark-0.1.39-py3-none-any.whl
- Subject digest: 13d48f94a3c20f15762eeae2e75b2d5a2ce1134cd98253a291185b6febf586c4
- Sigstore transparency entry: 763778687
- Sigstore integration time: Dec 14, 2025
Source repository:
- Permalink: Alechko375/sifr-benchmark@d8e78d656645ae30d385eb184b584a179a41ad1b
- Branch / Tag: refs/tags/v0.1.39
- Owner: https://github.com/Alechko375
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: benchmark.yml@d8e78d656645ae30d385eb184b584a179a41ad1b
- Trigger Event: push

sifr-benchmark 0.1.39

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SiFR Benchmark

Results

Budget Matters

What is SiFR?

Installation

Prerequisites

Quick Start

How It Works

Single Session Architecture

Verification Pipeline

Benchmark Modes

🤖 Compound Tasks (AI Agents)

👨‍💻 Dev Tasks (Frontend Developers)

🎨 Design Tasks (UI/UX Designers)

🔄 Combined Mode

Options

Multi-Model Comparison

Supported Models

Output Examples

Compound Tasks

Verbose Output (-v)

Debug Output (--debug)

Run Directory Structure

Why Each Format Fails

Other Commands

Use Cases

For AI Agent Developers

For Frontend Developers

For UI/UX Designers

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance