Skip to main content

Benchmark for evaluating LLM understanding of web UI: SiFR vs HTML vs AXTree vs Screenshots

Project description

SiFR Benchmark

How well do AI agents understand web UI?

Benchmark comparing SiFR vs HTML vs AXTree vs Screenshots across complex websites.

โš ๏ธ This is an example run, not a definitive study. The benchmark is fully reproducible โ€” run it yourself on your sites, your models, your use cases. We show our results; you verify on yours.

Results

Tested on 10 high-complexity sites: Amazon, YouTube, Reddit, eBay, Walmart, Airbnb, Yelp, IMDB, ESPN, GitHub.

All formats tested with equal 400KB token budget for fair comparison.

Format Accuracy Tokens (avg)
SiFR 71.7% 102K
Screenshot 27.0% 38K
Raw HTML 11.4% 122K
AXTree 1.5% 6K

SiFR is 2.7x more accurate than screenshots and 6.3x more accurate than raw HTML.

Per-Site Breakdown

Site SiFR Screenshot HTML AXTree
GitHub ๐Ÿ† 100% 0% โ€” 0%
YouTube ๐Ÿ† 100% 64% 0% 0%
Amazon ๐Ÿ† 85.7% 22.9% โ€” 0%
Walmart ๐Ÿ† 85.7% 13.3% 11.4% 0%
Reddit ๐Ÿ† 83.3% 36% โ€” 0%
Yelp ๐Ÿ† 62.5% 57.1% โ€” 0%
ESPN ๐Ÿ† 57.1% 11.4% 22.9% 0%
IMDB ๐Ÿ† 50% 16% โ€” 16.7%
eBay ๐Ÿ† 28.6% 26.7% 11.4% 0%

SiFR wins on 9 out of 9 sites where it ran successfully.

What is SiFR?

Structured Interface Format for Representation โ€” a compact format optimized for LLM understanding of web UI.

a015:
  tag: a
  text: "Add to Cart"
  box: [500, 300, 120, 40]
  attrs: {href: "/cart/add", class: "btn-primary"}
  salience: high

Key advantages:

  • Actionable IDs: Every element gets a unique ID (a015, btn003)
  • Salience scoring: High/medium/low importance ranking
  • Structured for LLMs: Optimized for "find element โ†’ take action" tasks
  • Model-agnostic: Works with any LLM that can read text

Installation

pip install sifr-benchmark

Prerequisites

  1. Element-to-LLM Chrome Extension โ€” captures pages in SiFR format

    • Load unpacked from element-to-llm-chrome/
  2. API Keys

    export OPENAI_API_KEY=sk-...
    export ANTHROPIC_API_KEY=sk-ant-...  # optional
    
  3. Playwright (for automated capture)

    playwright install chromium
    

Quick Start

Full Benchmark

Capture โ†’ Generate Ground Truth โ†’ Test โ€” all in one command:

sifr-bench full-benchmark-e2llm https://www.amazon.com https://www.youtube.com \
  -e /path/to/element-to-llm-extension \
  -s 400

Options:

  • -e, --extension โ€” Path to E2LLM extension (required)
  • -s, --target-size โ€” Token budget in KB for ALL formats (default: 400)
  • -m, --models โ€” Models to test (default: gpt-4o-mini)
  • -v, --verbose โ€” Show detailed output

Other Commands

# List all benchmark runs
sifr-bench list-runs

# Compare multiple runs
sifr-bench compare benchmark_runs/run_1 benchmark_runs/run_2

# Validate SiFR files
sifr-bench validate examples/

# Show help
sifr-bench info

How It Works

1. Capture

The extension captures 4 formats simultaneously:

  • SiFR โ€” Structured format with salience scoring
  • HTML โ€” Raw rendered DOM (outerHTML)
  • AXTree โ€” Playwright accessibility tree
  • Screenshot โ€” Full-page PNG

2. Ground Truth Generation

GPT-4o Vision analyzes screenshot + SiFR to generate agent tasks:

  • Click: "Click the Sign In button" โ†’ a003
  • Input: "Enter search query" โ†’ input001
  • Locate: "Find the main heading" โ†’ h1001

3. Benchmark

Each format tested with same token budget, same model, same prompts:

Task: "Click on the shopping cart icon"
Expected: a015

SiFR response: a015 โœ“
HTML response: none โœ—
Screenshot response: cart icon (no ID) โœ—

Methodology Notes

Run it yourself. This benchmark exists so you can test on your own sites and models. Our results are one data point โ€” your results on your use case matter more.

  • Equal token budget: All formats truncated to same size (400KB default). Fair comparison.

  • Ground truth is auto-generated: GPT-4o Vision creates tasks. For production, consider human verification.

  • AXTree 0% is a real finding: Many agent frameworks use accessibility trees. This shows why that's problematic.

  • 7 tasks per site: Practical, not academic. When did you last need 2000 clicks on one page?

Why Raw HTML Fails

Amazon HTML: 909KB original
After truncation: 400KB (loses 56% of content)
Result: 0% accuracy โ€” critical elements gone

Amazon SiFR: 613KB original  
After truncation: 400KB (loses 35% of content)
Result: 85.7% accuracy โ€” structure survives

HTML is verbose. When you truncate it, you lose random chunks. SiFR is pre-compressed with salience scoring โ€” important elements survive truncation.

Output Format

        Benchmark Results: Combined (10 sites)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Format     โ”ƒ Accuracy โ”ƒ  Tokens โ”ƒ  Latency โ”ƒ Status โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ sifr       โ”‚    71.7% โ”‚ 101,683 โ”‚ 30,221ms โ”‚   โœ…   โ”‚
โ”‚ screenshot โ”‚    27.0% โ”‚  38,074 โ”‚  7,942ms โ”‚   โš ๏ธ   โ”‚
โ”‚ html_raw   โ”‚    11.4% โ”‚ 122,190 โ”‚ 35,901ms โ”‚   โš ๏ธ   โ”‚
โ”‚ axtree     โ”‚     1.5% โ”‚   6,044 โ”‚  2,034ms โ”‚   โš ๏ธ   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Status:

  • โœ… Success (accuracy โ‰ฅ 50%)
  • โš ๏ธ Warning (accuracy < 50%)
  • โŒ Failed (accuracy = 0%)

Run Directory Structure

Each benchmark creates an isolated run:

benchmark_runs/run_20251206_210357/
โ”œโ”€โ”€ captures/
โ”‚   โ”œโ”€โ”€ sifr/*.sifr
โ”‚   โ”œโ”€โ”€ html/*.html
โ”‚   โ”œโ”€โ”€ axtree/*.json
โ”‚   โ””โ”€โ”€ screenshots/*.png
โ”œโ”€โ”€ ground-truth/*.json
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ raw_results.json
โ”‚   โ””โ”€โ”€ summary.json
โ””โ”€โ”€ run_meta.json

Tested Models

Default: gpt-4o-mini

The benchmark supports any OpenAI or Anthropic model. Run with different models:

sifr-bench full-benchmark-e2llm ... -m gpt-4o
sifr-bench full-benchmark-e2llm ... -m claude-sonnet

Contributing

  • Add test sites: Run benchmark on more URLs
  • Improve ground truth: Manual verification of tasks
  • New models: Add support in models.py
  • Bug reports: Open an issue

Citation

@misc{sifr2025,
  title={SiFR: Structured Interface Format for AI Web Agents},
  author={SiFR Contributors},
  year={2025},
  url={https://github.com/Alechko375/sifr-benchmark}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifr_benchmark-0.1.22.tar.gz (43.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sifr_benchmark-0.1.22-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file sifr_benchmark-0.1.22.tar.gz.

File metadata

  • Download URL: sifr_benchmark-0.1.22.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sifr_benchmark-0.1.22.tar.gz
Algorithm Hash digest
SHA256 e6631442b5b11882718a02e4c82b63b94b43b8d1b5e78344fc15365c9f9e26e7
MD5 d6115926bc28d34776d54c3f0f48ebb2
BLAKE2b-256 4552c53afd6049eeeeb4765dbc77685d096ae4968430cf99ae5ae146bb5eccca

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.22.tar.gz:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sifr_benchmark-0.1.22-py3-none-any.whl.

File metadata

File hashes

Hashes for sifr_benchmark-0.1.22-py3-none-any.whl
Algorithm Hash digest
SHA256 32c2ada15ce0b712f175c7c8a6d743b1ff75115414b0c15b9a5d5e4cf89b5dfe
MD5 1f3dda3b46969484995577b4d7fc5b62
BLAKE2b-256 37940251884dca404f5e46bbdf264d56cd26e89c53cb9e7a50d8628bec38cf9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sifr_benchmark-0.1.22-py3-none-any.whl:

Publisher: benchmark.yml on Alechko375/sifr-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page