SoM (Set-of-Mark) detection pipeline for macOS — Apple Vision + Florence-2 on MLX

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

uitag

A Set-of-Mark (SoM) detection pipeline for macOS that transforms screenshots into structured, annotated element maps. Built for Apple Silicon using Apple Vision Framework and Florence-2 on MLX.

uitag output — 151 UI elements detected on a 1920x1080 screenshot

151 numbered elements detected in ~0.8s — text labels (Apple Vision), rectangles, icons, and buttons (Florence-2). Full manifest JSON →

Why This Exists

Vision language models under 10B parameters cannot reliably detect individual UI elements on complex professional screenshots. They collapse to a single full-screen bounding box. This was validated empirically across Florence-2, PTA-1, and other MIT-licensed detection models during a benchmark of 14+ models.

uitag solves this by combining two complementary detection systems and preprocessing the image before any VLM ever sees it:

Apple Vision Framework detects text labels and rectangular UI elements natively on the ANE — effectively free
Florence-2 detects non-text elements (icons, buttons, images) via open-vocabulary detection on tiled image quadrants

The result is a numbered element map (SoM annotation) and a structured JSON manifest that downstream agents can consume directly.

Pipeline Architecture

Screenshot (1920x1080)
    |
    v
[1] Apple Vision (Swift binary)
    |  VNRecognizeTextRequest + VNDetectRectanglesRequest
    |  ~189ms (fast) / ~980ms (accurate)
    v
[2] Object-Aware Tiling
    |  Split into 4 quadrants, cut lines avoid bounding boxes
    v
[3] Florence-2 (mlx_vlm, per quadrant)
    |  <OD> detection on each tile, ~160ms/quadrant
    v
[4] Merge + Deduplicate
    |  IoU-based overlap removal, source priority ranking
    v
[5] SoM Annotation
    |  Numbered markers + colored bounding boxes
    v
[6] JSON Manifest
    |  Element list with coordinates, labels, sources, timing
    v
Output: annotated.png + manifest.json

End-to-end on a 1920x1080 VS Code screenshot (~151 UI elements detected):

~0.8s with fast OCR (Florence-2 ~650ms + Vision ~189ms)
~1.6s with accurate OCR (Florence-2 ~650ms + Vision ~980ms)

Quick Start

# Install from PyPI
pip install uitag

# Run on a screenshot
uitag screenshot.png --output-dir out/

Development Setup

git clone https://github.com/swaylenhayes/uitag.git
cd uitag
uv pip install -e ".[dev]"
uv run pytest  # 55 fast tests

Output

Two files are produced:

screenshot-som.png — the original image with numbered SoM annotations overlaid
screenshot-manifest.json — structured element data:

{
  "image_width": 1920,
  "image_height": 1080,
  "element_count": 151,
  "elements": [
    {
      "som_id": 1,
      "label": "File",
      "bbox": {"x": 48, "y": 0, "width": 26, "height": 16},
      "confidence": 1.0,
      "source": "vision_text"
    }
  ],
  "timing_ms": {
    "vision_ms": 189.3,
    "florence_total_ms": 648.2,
    "florence_backend": "mlx"
  }
}

CLI Options

uitag <image> [options]

Options:
  -o, --output-dir DIR    Output directory (default: current)
  --task TASK             Florence-2 task token (default: <OD>)
  --overlap N             Quadrant overlap in pixels (default: 50)
  --iou FLOAT             IoU dedup threshold (default: 0.5)
  --fast                  Use fast OCR (5x faster, noisier text)
  --backend BACKEND       Detection backend: auto (default), coreml, mlx

Requirements

macOS (Apple Vision Framework is macOS-only)
Apple Silicon (MLX requires Metal)
Python 3.10+
Florence-2 model: mlx-community/Florence-2-base-ft-4bit (~159MB, downloaded automatically on first run)

Backend System

uitag supports pluggable detection backends via the DetectionBackend protocol:

MLX (default) — Florence-2 inference on GPU via Metal. ~160ms per quadrant on M2 Max.
CoreML — DaViT vision encoder on Apple Neural Engine, decoder on GPU. Useful when GPU is contended by other workloads. Requires a converted model (python tools/convert_davit_coreml.py).

# Use default MLX backend
uitag screenshot.png

# Use CoreML backend (ANE offload)
uitag screenshot.png --backend coreml

Research Background

uitag emerged from a structured research effort evaluating detection approaches for a UI agent operating on macOS:

Model survey (14+ models): Evaluated detection models across HuggingFace, academic sources, and commercial options. AGPL-licensed models (Screen2AX, OmniParser, YOLO variants) were excluded — the target product ships under MIT. Florence-2, PTA-1, and Florence-2-large were shortlisted.

Benchmark findings:

Florence-2-base-ft-4bit: 133ms warm inference, 159MB RAM, effective 4-bit quantization
Florence-2-large-ft-4bit: eliminated — degenerate output (repeating <s> tokens) at 4-bit quantization
PTA-1 (UI-specialized Florence-2 fine-tune): viable but 3x RAM (quantization achieved only 14-bit vs 4-bit target)

Critical discovery: All sub-10B detection models produce single full-screen bounding boxes on complex screenshots but work correctly on tiled inputs. This is a model capacity limitation, not a tuning problem — 7 configurations of frequency/repetition penalties were tested with no improvement. Tiling is architecturally required.

Object-aware tiling: Naive quadrant splits bisect UI elements at cut boundaries. uitag searches outward from the midpoint to find cut lines that avoid intersecting any detected bounding box, falling back to the midpoint with extra overlap padding when no clean gap exists.

Design Decisions

Decision	Rationale
Apple Vision + Florence-2 hybrid	Complementary strengths: Vision handles text/rectangles (free, ANE), Florence-2 handles open-vocabulary objects
4-quadrant tiling	Simple, effective — keeps element count per tile manageable for sub-10B models
Object-aware cut placement	Prevents element fragmentation at tile boundaries
Pre-compiled Swift binary	Saves ~230ms JIT startup per Vision invocation
IoU dedup with source priority	Vision text > Vision rect > Florence-2 (higher priority sources kept on overlap)
No confidence score gating	VLM confidence scores correlate poorly with actual accuracy (~0.55 AUROC = near random)
MLX default backend	1.25x faster than CoreML on idle GPU; CoreML available for GPU-contended workflows

Tests

# Fast tests (no model loading required)
pytest

# All tests including model-dependent ones
pytest --run-slow

55 fast tests covering: location token parsing, quadrant splitting, IoU computation, merge deduplication, SoM rendering, manifest generation, schema validation, Apple Vision integration, backend protocol, backend selection, and encoder bridge conversion.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

swaylenhayes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Apr 9, 2026

0.5.1

Mar 31, 2026

0.5.0

Mar 29, 2026

0.4.1

Mar 7, 2026

0.4.0

Mar 7, 2026

0.3.1

Mar 3, 2026

This version

0.3.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uitag-0.3.0.tar.gz (32.0 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uitag-0.3.0-py3-none-any.whl (27.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file uitag-0.3.0.tar.gz.

File metadata

Download URL: uitag-0.3.0.tar.gz
Upload date: Feb 28, 2026
Size: 32.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for uitag-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`50fab26fc83d75e0cee7590a6c3c5225d76359dfd080cf13fff68e47f91a387f`
MD5	`5d4124ab612e22848addb2e8b2f72006`
BLAKE2b-256	`212b362253b8bae320b2af80ac6f6563a23c375c49848b4e0e758d298367c0ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for uitag-0.3.0.tar.gz:

Publisher: publish.yml on swaylenhayes/uitag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: uitag-0.3.0.tar.gz
- Subject digest: 50fab26fc83d75e0cee7590a6c3c5225d76359dfd080cf13fff68e47f91a387f
- Sigstore transparency entry: 1004938578
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: swaylenhayes/uitag@6da4d2cf81966c2d1ffd5ae85a00467ce9bace09
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/swaylenhayes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6da4d2cf81966c2d1ffd5ae85a00467ce9bace09
- Trigger Event: push

File details

Details for the file uitag-0.3.0-py3-none-any.whl.

File metadata

Download URL: uitag-0.3.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for uitag-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`355978acfdee1baa890fd9660afdabc40c219e2855126e7d6ba67efeab1e2437`
MD5	`0f13a511a82e372453e27cca4a6c1d7c`
BLAKE2b-256	`a23b130ade78d85bb93eac53bc3364a3f3aaeee83af4369f9dd9066451fa1cb0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for uitag-0.3.0-py3-none-any.whl:

Publisher: publish.yml on swaylenhayes/uitag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: uitag-0.3.0-py3-none-any.whl
- Subject digest: 355978acfdee1baa890fd9660afdabc40c219e2855126e7d6ba67efeab1e2437
- Sigstore transparency entry: 1004938580
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: swaylenhayes/uitag@6da4d2cf81966c2d1ffd5ae85a00467ce9bace09
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/swaylenhayes
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6da4d2cf81966c2d1ffd5ae85a00467ce9bace09
- Trigger Event: push

uitag 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

uitag

Why This Exists

Pipeline Architecture

Quick Start

Development Setup

Output

CLI Options

Requirements

Backend System

Research Background

Design Decisions

Tests

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance