Skip to main content

Benchmark for evaluating LLM understanding of web UI: SiFR vs HTML vs AXTree vs Screenshots

Project description

SiFR Benchmark

How well do AI agents understand web UI?

Benchmark comparing SiFR vs HTML vs AXTree vs Screenshots across 10 complex websites.

Results

Tested on 10 high-complexity sites: Amazon, YouTube, Reddit, eBay, Walmart, Airbnb, Yelp, IMDB, ESPN, GitHub.

Format Accuracy Tokens (avg) Latency
SiFR 64.6% 25,512 7.5s
Screenshot 21.5% 37,765 8.0s
Raw HTML 4.7% 32,879 8.3s
AXTree 3.0% 5,289 1.9s

SiFR is 3x more accurate than screenshots and 14x more accurate than raw HTML.

Per-Site Breakdown

Site SiFR Screenshot HTML AXTree
GitHub ๐Ÿ† 100% 0% 0% 0%
YouTube ๐Ÿ† 100% 53.3% 0% 0%
Walmart ๐Ÿ† 85.7% 30% 11.4% 0%
Reddit ๐Ÿ† 83.3% 0% 0% 0%
eBay ๐Ÿ† 71.4% 13.3% 0% 14.3%
Amazon ๐Ÿ† 66.7% 25.7% 0% 0%
Airbnb ๐Ÿ† 57.1% 0% 34.3% 0%
Yelp ๐Ÿค 50% 50% 0% 12.5%
ESPN ๐Ÿ† 42.9% 0% 0% 0%
IMDB 0% ๐Ÿ† 45% 0% 0%

SiFR wins on 9 out of 10 sites.

What is SiFR?

Structured Interface Format for Representation โ€” a compact format optimized for LLM understanding of web UI.

a015:
  tag: a
  text: "Add to Cart"
  box: [500, 300, 120, 40]
  attrs: {href: "/cart/add", class: "btn-primary"}
  salience: high

Key advantages:

  • Compact: 10-20x smaller than raw HTML
  • Actionable IDs: Every element has a unique ID (a015, btn003)
  • Salience scoring: High/medium/low importance ranking
  • LLM-native: Structured for AI comprehension

Installation

pip install sifr-benchmark

Prerequisites

  1. Element-to-LLM Chrome Extension โ€” captures pages in SiFR format

  2. API Keys

    export OPENAI_API_KEY=sk-...
    export ANTHROPIC_API_KEY=sk-ant-...  # optional
    
  3. Playwright (for automated capture)

    playwright install chromium
    

Quick Start

Full Benchmark (Recommended)

Capture โ†’ Generate Ground Truth โ†’ Test โ€” all in one command:

sifr-bench full-benchmark-e2llm https://www.amazon.com https://www.youtube.com \
  -e /path/to/element-to-llm-extension \
  -s 400

Options:

  • -e, --extension โ€” Path to E2LLM extension (required)
  • -s, --target-size โ€” SiFR budget in KB (default: 100, max: 380)
  • -m, --models โ€” Models to test (default: gpt-4o-mini)
  • -v, --verbose โ€” Show detailed output

Other Commands

# List all benchmark runs
sifr-bench list-runs

# Compare multiple runs
sifr-bench compare benchmark_runs/run_1 benchmark_runs/run_2

# Validate SiFR files
sifr-bench validate examples/

# Show help
sifr-bench info

How It Works

1. Capture (E2LLM Extension)

The extension captures 4 formats simultaneously:

  • SiFR โ€” Structured format with salience scoring
  • HTML โ€” Raw rendered DOM (outerHTML)
  • AXTree โ€” Playwright accessibility tree
  • Screenshot โ€” Full-page PNG

2. Ground Truth Generation

GPT-4o Vision analyzes the screenshot + SiFR to generate tasks:

  • Click tasks: "Click the Sign In button" โ†’ a003
  • Input tasks: "Enter search query" โ†’ input001
  • Locate tasks: "Find the main heading" โ†’ h1001

3. Benchmark

Each format is tested against the same ground truth:

Question: "Click on the shopping cart icon"
Expected: a015
SiFR response: a015 โœ“
HTML response: none โœ—

Output Format

        Benchmark Results: Combined (10 sites)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Format     โ”ƒ Accuracy โ”ƒ Tokens โ”ƒ Latency โ”ƒ Status โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ sifr       โ”‚    64.6% โ”‚ 25,512 โ”‚  7,511msโ”‚   โœ…   โ”‚
โ”‚ screenshot โ”‚    21.5% โ”‚ 37,765 โ”‚  8,039msโ”‚   โš ๏ธ   โ”‚
โ”‚ html_raw   โ”‚     4.7% โ”‚ 32,879 โ”‚  8,332msโ”‚   โš ๏ธ   โ”‚
โ”‚ axtree     โ”‚     3.0% โ”‚  5,289 โ”‚  1,876msโ”‚   โš ๏ธ   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Status icons:

  • โœ… Success (accuracy โ‰ฅ 50%)
  • โš ๏ธ Warning (accuracy < 50%)
  • โŒ Failed (accuracy = 0%)

Run Directory Structure

Each benchmark creates an isolated run:

benchmark_runs/run_20251206_182941/
โ”œโ”€โ”€ captures/
โ”‚   โ”œโ”€โ”€ sifr/*.sifr
โ”‚   โ”œโ”€โ”€ html/*.html
โ”‚   โ”œโ”€โ”€ axtree/*.json
โ”‚   โ””โ”€โ”€ screenshots/*.png
โ”œโ”€โ”€ ground-truth/*.json
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ raw_results.json
โ”‚   โ””โ”€โ”€ summary.json
โ””โ”€โ”€ run_meta.json

Key Findings

  1. SiFR dominates complex sites โ€” 100% on GitHub/YouTube, 85%+ on Walmart/Reddit
  2. Screenshots struggle with dense UI โ€” Can't reliably identify elements
  3. Raw HTML is unusable โ€” Too large, no semantic structure for LLMs
  4. AXTree IDs don't match โ€” Own ID scheme incompatible with ground truth

Why IMDB Failed?

IMDB has the largest DOM (706KB SiFR, 2171KB HTML). Truncation to 97KB removes critical elements. This highlights the need for smarter budgeting in the E2LLM extension.

Tested Models

  • GPT-4o-mini (default)
  • GPT-4o
  • Claude 3.5 Sonnet
  • Claude 3 Haiku

Contributing

  • Add test sites: Run benchmark on more URLs
  • Improve ground truth: Manual verification of tasks
  • New models: Add support in models.py

Citation

@misc{sifr2025,
  title={SiFR: Structured Interface Format for AI Web Agents},
  author={SiFR Contributors},
  year={2025},
  url={https://github.com/Alechko375/sifr-benchmark}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifr_benchmark-0.1.21.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sifr_benchmark-0.1.21-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file sifr_benchmark-0.1.21.tar.gz.

File metadata

  • Download URL: sifr_benchmark-0.1.21.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifr_benchmark-0.1.21.tar.gz
Algorithm Hash digest
SHA256 61d72bb6291c4caa6311a22ebac52c492a776f7f2800b37c9b61cf01939267d5
MD5 7670af2d28a060040785fb6a2136313e
BLAKE2b-256 3b10ae1e90da2a5c52baac2c0beb85aea48c8fbec2efa566102d6db3f7dce578

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.21.tar.gz:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sifr_benchmark-0.1.21-py3-none-any.whl.

File metadata

File hashes

Hashes for sifr_benchmark-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 c8171498474a8dc317777de4bf3efc711c94196bee75641e6be782972c2938e4
MD5 470ded003ae662094da0945a0f761c3d
BLAKE2b-256 b11ad3b61e486c3fcbbb24d8db780441da170bef1bd7d9c97d6707d022adbb75

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.21-py3-none-any.whl:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page