Skip to main content

Computer Vision and OCR library for detecting and analyzing UI elements

Project description

Shows my svg

Python macOS Discord PyPI

Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (CUA) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.

Features

  • Optimized for Apple Silicon with MPS acceleration
  • Icon detection using YOLO with multi-scale processing
  • Text recognition using EasyOCR (GPU-accelerated)
  • Automatic hardware detection (MPS → CUDA → CPU)
  • Smart detection parameters tuned for UI elements
  • Detailed visualization with numbered annotations
  • Performance benchmarking tools

System Requirements

  • Recommended: macOS with Apple Silicon

    • Uses Metal Performance Shaders (MPS)
    • Multi-scale detection enabled
    • ~0.4s average detection time
  • Supported: Any Python 3.11+ environment

    • Falls back to CPU if no GPU available
    • Single-scale detection on CPU
    • ~1.3s average detection time

Installation

# Using PDM (recommended)
pdm install

# Using pip
pip install -e .

Quick Start

from som import OmniParser
from PIL import Image

# Initialize parser
parser = OmniParser()

# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
    image,
    box_threshold=0.3,    # Confidence threshold
    iou_threshold=0.1,    # Overlap threshold
    use_ocr=True         # Enable text detection
)

# Access results
for elem in result.elements:
    if elem.type == "icon":
        print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
    else:  # text
        print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")

Configuration

Detection Parameters

Box Threshold (0.3)

Controls the confidence threshold for accepting detections:

High Threshold (0.3):     Low Threshold (0.01):
+----------------+        +----------------+
|                |        |  +--------+    |
|   Confident    |        |  |Unsure?|    |
|   Detection    |        |  +--------+    |
|   (✓ Accept)   |        |  (? Reject)   |
|                |        |                |
+----------------+        +----------------+
conf = 0.85             conf = 0.02
  • Higher values (0.3) yield more precise but fewer detections
  • Lower values (0.01) catch more potential icons but increase false positives
  • Default is 0.3 for optimal precision/recall balance

IOU Threshold (0.1)

Controls how overlapping detections are merged:

IOU = Intersection Area / Union Area

Low Overlap (Keep Both):   High Overlap (Merge):
+----------+              +----------+
|     Box1 |              |  Box1   |
|          |     vs.      |+-----+  |
+----------+              ||Box2 |  |
    +----------+          |+-----+  |
    |   Box2   |          +----------+
    |          |
    +----------+
IOU ≈ 0.05 (Keep Both)    IOU ≈ 0.7 (Merge)
  • Lower values (0.1) more aggressively remove overlapping boxes
  • Higher values (0.5) allow more overlapping detections
  • Default is 0.1 to handle densely packed UI elements

OCR Configuration

  • Engine: EasyOCR

    • Primary choice for all platforms
    • Fast initialization and processing
    • Built-in English language support
    • GPU acceleration when available
  • Settings:

    • Timeout: 5 seconds
    • Confidence threshold: 0.5
    • Paragraph mode: Disabled
    • Language: English only

Performance

Hardware Acceleration

MPS (Metal Performance Shaders)

  • Multi-scale detection (640px, 1280px, 1920px)
  • Test-time augmentation enabled
  • Half-precision (FP16)
  • Average detection time: ~0.4s
  • Best for production use when available

CPU

  • Single-scale detection (1280px)
  • Full-precision (FP32)
  • Average detection time: ~1.3s
  • Reliable fallback option

Example Output Structure

examples/output/
├── {timestamp}_no_ocr/
│   ├── annotated_images/
│   │   └── screenshot_analyzed.png
│   ├── screen_details.txt
│   └── summary.json
└── {timestamp}_ocr/
    ├── annotated_images/
    │   └── screenshot_analyzed.png
    ├── screen_details.txt
    └── summary.json

Development

Test Data

  • Place test screenshots in examples/test_data/
  • Not tracked in git to keep repository size manageable
  • Default test image: test_screen.png (1920x1080)

Running Tests

# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none

# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr

License

MIT License - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_som-0.1.2.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cua_som-0.1.2-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file cua_som-0.1.2.tar.gz.

File metadata

  • Download URL: cua_som-0.1.2.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e68d222f32cb8b38071b8eaf2175baabd625baa1e09524baea7bd18f5fa363fe
MD5 b26b3ae11d7c3d12b2f0e104cd0080b9
BLAKE2b-256 bb0aafba5caed966cef8f3945f25b726dfa0c18285e93b790aa695d9037f8279

See more details on using hashes here.

File details

Details for the file cua_som-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cua_som-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 10a10aa7ecf68217371e7db4e97dd6e3fa00271b80fcc71c7f303c03b0a19c77
MD5 c8a7969885c147f745b845344fe86be4
BLAKE2b-256 d89b0ed949d9c8f4844c7b4e139bb616ac53843170a288a5e587294218c07e8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page