Skip to main content

Computer Vision and OCR library for detecting and analyzing UI elements

Project description

Shows my svg

Python macOS Discord PyPI

Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (Cua) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.

Features

  • Optimized for Apple Silicon with MPS acceleration
  • Icon detection using YOLO with multi-scale processing
  • Text recognition using EasyOCR (GPU-accelerated)
  • Automatic hardware detection (MPS → CUDA → CPU)
  • Smart detection parameters tuned for UI elements
  • Detailed visualization with numbered annotations
  • Performance benchmarking tools

System Requirements

  • Recommended: macOS with Apple Silicon
    • Uses Metal Performance Shaders (MPS)
    • Multi-scale detection enabled
    • ~0.4s average detection time
  • Supported: Any Python 3.11+ environment
    • Falls back to CPU if no GPU available
    • Single-scale detection on CPU
    • ~1.3s average detection time

Installation

# Using PDM (recommended)
pdm install

# Using pip
pip install -e .

Quick Start

from som import OmniParser
from PIL import Image

# Initialize parser
parser = OmniParser()

# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
    image,
    box_threshold=0.3,    # Confidence threshold
    iou_threshold=0.1,    # Overlap threshold
    use_ocr=True         # Enable text detection
)

# Access results
for elem in result.elements:
    if elem.type == "icon":
        print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
    else:  # text
        print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")

Docs

Development

Test Data

  • Place test screenshots in examples/test_data/
  • Not tracked in git to keep repository size manageable
  • Default test image: test_screen.png (1920x1080)

Running Tests

# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none

# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr

License

MIT License - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_som-0.1.4.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cua_som-0.1.4-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file cua_som-0.1.4.tar.gz.

File metadata

  • Download URL: cua_som-0.1.4.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cua_som-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b45df20034aa42dae3c3f6c515668e375e593d6ef4917c6dd323f3154393133e
MD5 5117bfd70ee1132c1c1220c0cfa1e1a7
BLAKE2b-256 b533fd98e620bb39a495c81bf2862f7e5ea7eff24b08a6f079e06f50c915c83f

See more details on using hashes here.

File details

Details for the file cua_som-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: cua_som-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cua_som-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 72620adfa337df3d1caaac6f9916cea6adc500096984cbac4239b9fc8e012d9d
MD5 f0db49c56c71b2f4c215b0e98d67f3de
BLAKE2b-256 c0d9670e38c01c68003acf2a408a3b51ca0328046524b40c8f4268fb6f681cb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page