cua-som

Computer Vision and OCR library for detecting and analyzing UI elements

These details have not been verified by PyPI

Project links

Project description

Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (CUA) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.

Features

Optimized for Apple Silicon with MPS acceleration
Icon detection using YOLO with multi-scale processing
Text recognition using EasyOCR (GPU-accelerated)
Automatic hardware detection (MPS → CUDA → CPU)
Smart detection parameters tuned for UI elements
Detailed visualization with numbered annotations
Performance benchmarking tools

System Requirements

Recommended: macOS with Apple Silicon
- Uses Metal Performance Shaders (MPS)
- Multi-scale detection enabled
- ~0.4s average detection time
Supported: Any Python 3.11+ environment
- Falls back to CPU if no GPU available
- Single-scale detection on CPU
- ~1.3s average detection time

Installation

# Using PDM (recommended)
pdm install

# Using pip
pip install -e .

Quick Start

from som import OmniParser
from PIL import Image

# Initialize parser
parser = OmniParser()

# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
    image,
    box_threshold=0.3,    # Confidence threshold
    iou_threshold=0.1,    # Overlap threshold
    use_ocr=True         # Enable text detection
)

# Access results
for elem in result.elements:
    if elem.type == "icon":
        print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
    else:  # text
        print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")

Configuration

Detection Parameters

Box Threshold (0.3)

Controls the confidence threshold for accepting detections:

High Threshold (0.3):     Low Threshold (0.01):
+----------------+        +----------------+
|                |        |  +--------+    |
|   Confident    |        |  |Unsure?|    |
|   Detection    |        |  +--------+    |
|   (✓ Accept)   |        |  (? Reject)   |
|                |        |                |
+----------------+        +----------------+
conf = 0.85             conf = 0.02

Higher values (0.3) yield more precise but fewer detections
Lower values (0.01) catch more potential icons but increase false positives
Default is 0.3 for optimal precision/recall balance

IOU Threshold (0.1)

Controls how overlapping detections are merged:

IOU = Intersection Area / Union Area

Low Overlap (Keep Both):   High Overlap (Merge):
+----------+              +----------+
|     Box1 |              |  Box1   |
|          |     vs.      |+-----+  |
+----------+              ||Box2 |  |
    +----------+          |+-----+  |
    |   Box2   |          +----------+
    |          |
    +----------+
IOU ≈ 0.05 (Keep Both)    IOU ≈ 0.7 (Merge)

Lower values (0.1) more aggressively remove overlapping boxes
Higher values (0.5) allow more overlapping detections
Default is 0.1 to handle densely packed UI elements

OCR Configuration

Engine: EasyOCR
- Primary choice for all platforms
- Fast initialization and processing
- Built-in English language support
- GPU acceleration when available
Settings:
- Timeout: 5 seconds
- Confidence threshold: 0.5
- Paragraph mode: Disabled
- Language: English only

Performance

Hardware Acceleration

MPS (Metal Performance Shaders)

Multi-scale detection (640px, 1280px, 1920px)
Test-time augmentation enabled
Half-precision (FP16)
Average detection time: ~0.4s
Best for production use when available

CPU

Single-scale detection (1280px)
Full-precision (FP32)
Average detection time: ~1.3s
Reliable fallback option

Example Output Structure

examples/output/
├── {timestamp}_no_ocr/
│   ├── annotated_images/
│   │   └── screenshot_analyzed.png
│   ├── screen_details.txt
│   └── summary.json
└── {timestamp}_ocr/
    ├── annotated_images/
    │   └── screenshot_analyzed.png
    ├── screen_details.txt
    └── summary.json

Development

Test Data

Place test screenshots in examples/test_data/
Not tracked in git to keep repository size manageable
Default test image: test_screen.png (1920x1080)

Running Tests

# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none

# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr

License

MIT License - See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Feb 10, 2026

0.1.3

Apr 15, 2025

0.1.2

Apr 15, 2025

This version

0.1.1

Mar 19, 2025

0.1.0

Mar 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_som-0.1.1.tar.gz (19.0 kB view details)

Uploaded Mar 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cua_som-0.1.1-py3-none-any.whl (19.9 kB view details)

Uploaded Mar 19, 2025 Python 3

File details

Details for the file cua_som-0.1.1.tar.gz.

File metadata

Download URL: cua_som-0.1.1.tar.gz
Upload date: Mar 19, 2025
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f2818f4da20aa147520ace00bfc9cd1dd1099b45e0e2a36457a23ae9b41fa91a`
MD5	`a141fb456b221d27f591034acb5c45c1`
BLAKE2b-256	`9a3535613bf53f31944c2d7b9963a07b152263314e165294c6f50dad5e564fe6`

See more details on using hashes here.

File details

Details for the file cua_som-0.1.1-py3-none-any.whl.

File metadata

Download URL: cua_som-0.1.1-py3-none-any.whl
Upload date: Mar 19, 2025
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for cua_som-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bebb7dbed110fe0306658f2c9023cd9743c0758ac81b3f2c2b7cc2bdc5234ad`
MD5	`b2a1218460d7e200d686c06fc7b946ff`
BLAKE2b-256	`f43d21cabff1a8b3bc746691f04ff400f2a37069b3e283cd179fd90e91ac1fda`

See more details on using hashes here.

cua-som 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

System Requirements

Installation

Quick Start

Configuration

Detection Parameters

Box Threshold (0.3)

IOU Threshold (0.1)

OCR Configuration

Performance

Hardware Acceleration

MPS (Metal Performance Shaders)

CPU

Example Output Structure

Development

Test Data

Running Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes