Skip to main content

Computer Vision and OCR library for detecting and analyzing UI elements

Project description

Shows my svg

Python macOS Discord PyPI

Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (CUA) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.

Features

  • Optimized for Apple Silicon with MPS acceleration
  • Icon detection using YOLO with multi-scale processing
  • Text recognition using EasyOCR (GPU-accelerated)
  • Automatic hardware detection (MPS → CUDA → CPU)
  • Smart detection parameters tuned for UI elements
  • Detailed visualization with numbered annotations
  • Performance benchmarking tools

System Requirements

  • Recommended: macOS with Apple Silicon

    • Uses Metal Performance Shaders (MPS)
    • Multi-scale detection enabled
    • ~0.4s average detection time
  • Supported: Any Python 3.11+ environment

    • Falls back to CPU if no GPU available
    • Single-scale detection on CPU
    • ~1.3s average detection time

Installation

# Using PDM (recommended)
pdm install

# Using pip
pip install -e .

Quick Start

from som import OmniParser
from PIL import Image

# Initialize parser
parser = OmniParser()

# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
    image,
    box_threshold=0.3,    # Confidence threshold
    iou_threshold=0.1,    # Overlap threshold
    use_ocr=True         # Enable text detection
)

# Access results
for elem in result.elements:
    if elem.type == "icon":
        print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
    else:  # text
        print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")

Configuration

Detection Parameters

Box Threshold (0.3)

Controls the confidence threshold for accepting detections:

High Threshold (0.3):     Low Threshold (0.01):
+----------------+        +----------------+
|                |        |  +--------+    |
|   Confident    |        |  |Unsure?|    |
|   Detection    |        |  +--------+    |
|   (✓ Accept)   |        |  (? Reject)   |
|                |        |                |
+----------------+        +----------------+
conf = 0.85             conf = 0.02
  • Higher values (0.3) yield more precise but fewer detections
  • Lower values (0.01) catch more potential icons but increase false positives
  • Default is 0.3 for optimal precision/recall balance

IOU Threshold (0.1)

Controls how overlapping detections are merged:

IOU = Intersection Area / Union Area

Low Overlap (Keep Both):   High Overlap (Merge):
+----------+              +----------+
|     Box1 |              |  Box1   |
|          |     vs.      |+-----+  |
+----------+              ||Box2 |  |
    +----------+          |+-----+  |
    |   Box2   |          +----------+
    |          |
    +----------+
IOU ≈ 0.05 (Keep Both)    IOU ≈ 0.7 (Merge)
  • Lower values (0.1) more aggressively remove overlapping boxes
  • Higher values (0.5) allow more overlapping detections
  • Default is 0.1 to handle densely packed UI elements

OCR Configuration

  • Engine: EasyOCR

    • Primary choice for all platforms
    • Fast initialization and processing
    • Built-in English language support
    • GPU acceleration when available
  • Settings:

    • Timeout: 5 seconds
    • Confidence threshold: 0.5
    • Paragraph mode: Disabled
    • Language: English only

Performance

Hardware Acceleration

MPS (Metal Performance Shaders)

  • Multi-scale detection (640px, 1280px, 1920px)
  • Test-time augmentation enabled
  • Half-precision (FP16)
  • Average detection time: ~0.4s
  • Best for production use when available

CPU

  • Single-scale detection (1280px)
  • Full-precision (FP32)
  • Average detection time: ~1.3s
  • Reliable fallback option

Example Output Structure

examples/output/
├── {timestamp}_no_ocr/
│   ├── annotated_images/
│   │   └── screenshot_analyzed.png
│   ├── screen_details.txt
│   └── summary.json
└── {timestamp}_ocr/
    ├── annotated_images/
    │   └── screenshot_analyzed.png
    ├── screen_details.txt
    └── summary.json

Development

Test Data

  • Place test screenshots in examples/test_data/
  • Not tracked in git to keep repository size manageable
  • Default test image: test_screen.png (1920x1080)

Running Tests

# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none

# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr

License

MIT License - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_som-0.1.1.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cua_som-0.1.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file cua_som-0.1.1.tar.gz.

File metadata

  • Download URL: cua_som-0.1.1.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f2818f4da20aa147520ace00bfc9cd1dd1099b45e0e2a36457a23ae9b41fa91a
MD5 a141fb456b221d27f591034acb5c45c1
BLAKE2b-256 9a3535613bf53f31944c2d7b9963a07b152263314e165294c6f50dad5e564fe6

See more details on using hashes here.

File details

Details for the file cua_som-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: cua_som-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7bebb7dbed110fe0306658f2c9023cd9743c0758ac81b3f2c2b7cc2bdc5234ad
MD5 b2a1218460d7e200d686c06fc7b946ff
BLAKE2b-256 f43d21cabff1a8b3bc746691f04ff400f2a37069b3e283cd179fd90e91ac1fda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page