Computer Vision and OCR library for detecting and analyzing UI elements
Project description
Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (Cua) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.
Features
- Optimized for Apple Silicon with MPS acceleration
- Icon detection using YOLO with multi-scale processing
- Text recognition using EasyOCR (GPU-accelerated)
- Automatic hardware detection (MPS → CUDA → CPU)
- Smart detection parameters tuned for UI elements
- Detailed visualization with numbered annotations
- Performance benchmarking tools
System Requirements
- Recommended: macOS with Apple Silicon
- Uses Metal Performance Shaders (MPS)
- Multi-scale detection enabled
- ~0.4s average detection time
- Supported: Any Python 3.11+ environment
- Falls back to CPU if no GPU available
- Single-scale detection on CPU
- ~1.3s average detection time
Installation
# Using PDM (recommended)
pdm install
# Using pip
pip install -e .
Quick Start
from som import OmniParser
from PIL import Image
# Initialize parser
parser = OmniParser()
# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
image,
box_threshold=0.3, # Confidence threshold
iou_threshold=0.1, # Overlap threshold
use_ocr=True # Enable text detection
)
# Access results
for elem in result.elements:
if elem.type == "icon":
print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
else: # text
print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")
Docs
Development
Test Data
- Place test screenshots in
examples/test_data/ - Not tracked in git to keep repository size manageable
- Default test image:
test_screen.png(1920x1080)
Running Tests
# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none
# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr
License
MIT License - See LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cua_som-0.1.4.tar.gz.
File metadata
- Download URL: cua_som-0.1.4.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b45df20034aa42dae3c3f6c515668e375e593d6ef4917c6dd323f3154393133e
|
|
| MD5 |
5117bfd70ee1132c1c1220c0cfa1e1a7
|
|
| BLAKE2b-256 |
b533fd98e620bb39a495c81bf2862f7e5ea7eff24b08a6f079e06f50c915c83f
|
File details
Details for the file cua_som-0.1.4-py3-none-any.whl.
File metadata
- Download URL: cua_som-0.1.4-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72620adfa337df3d1caaac6f9916cea6adc500096984cbac4239b9fc8e012d9d
|
|
| MD5 |
f0db49c56c71b2f4c215b0e98d67f3de
|
|
| BLAKE2b-256 |
c0d9670e38c01c68003acf2a408a3b51ca0328046524b40c8f4268fb6f681cb7
|