Computer Vision and OCR library for detecting and analyzing UI elements
Project description
Som (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (CUA) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.
Features
- Optimized for Apple Silicon with MPS acceleration
- Icon detection using YOLO with multi-scale processing
- Text recognition using EasyOCR (GPU-accelerated)
- Automatic hardware detection (MPS → CUDA → CPU)
- Smart detection parameters tuned for UI elements
- Detailed visualization with numbered annotations
- Performance benchmarking tools
System Requirements
-
Recommended: macOS with Apple Silicon
- Uses Metal Performance Shaders (MPS)
- Multi-scale detection enabled
- ~0.4s average detection time
-
Supported: Any Python 3.11+ environment
- Falls back to CPU if no GPU available
- Single-scale detection on CPU
- ~1.3s average detection time
Installation
# Using PDM (recommended)
pdm install
# Using pip
pip install -e .
Quick Start
from som import OmniParser
from PIL import Image
# Initialize parser
parser = OmniParser()
# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
image,
box_threshold=0.3, # Confidence threshold
iou_threshold=0.1, # Overlap threshold
use_ocr=True # Enable text detection
)
# Access results
for elem in result.elements:
if elem.type == "icon":
print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
else: # text
print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")
Configuration
Detection Parameters
Box Threshold (0.3)
Controls the confidence threshold for accepting detections:
High Threshold (0.3): Low Threshold (0.01):
+----------------+ +----------------+
| | | +--------+ |
| Confident | | |Unsure?| |
| Detection | | +--------+ |
| (✓ Accept) | | (? Reject) |
| | | |
+----------------+ +----------------+
conf = 0.85 conf = 0.02
- Higher values (0.3) yield more precise but fewer detections
- Lower values (0.01) catch more potential icons but increase false positives
- Default is 0.3 for optimal precision/recall balance
IOU Threshold (0.1)
Controls how overlapping detections are merged:
IOU = Intersection Area / Union Area
Low Overlap (Keep Both): High Overlap (Merge):
+----------+ +----------+
| Box1 | | Box1 |
| | vs. |+-----+ |
+----------+ ||Box2 | |
+----------+ |+-----+ |
| Box2 | +----------+
| |
+----------+
IOU ≈ 0.05 (Keep Both) IOU ≈ 0.7 (Merge)
- Lower values (0.1) more aggressively remove overlapping boxes
- Higher values (0.5) allow more overlapping detections
- Default is 0.1 to handle densely packed UI elements
OCR Configuration
-
Engine: EasyOCR
- Primary choice for all platforms
- Fast initialization and processing
- Built-in English language support
- GPU acceleration when available
-
Settings:
- Timeout: 5 seconds
- Confidence threshold: 0.5
- Paragraph mode: Disabled
- Language: English only
Performance
Hardware Acceleration
MPS (Metal Performance Shaders)
- Multi-scale detection (640px, 1280px, 1920px)
- Test-time augmentation enabled
- Half-precision (FP16)
- Average detection time: ~0.4s
- Best for production use when available
CPU
- Single-scale detection (1280px)
- Full-precision (FP32)
- Average detection time: ~1.3s
- Reliable fallback option
Example Output Structure
examples/output/
├── {timestamp}_no_ocr/
│ ├── annotated_images/
│ │ └── screenshot_analyzed.png
│ ├── screen_details.txt
│ └── summary.json
└── {timestamp}_ocr/
├── annotated_images/
│ └── screenshot_analyzed.png
├── screen_details.txt
└── summary.json
Development
Test Data
- Place test screenshots in
examples/test_data/ - Not tracked in git to keep repository size manageable
- Default test image:
test_screen.png(1920x1080)
Running Tests
# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none
# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr
License
MIT License - See LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cua_som-0.1.1.tar.gz.
File metadata
- Download URL: cua_som-0.1.1.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2818f4da20aa147520ace00bfc9cd1dd1099b45e0e2a36457a23ae9b41fa91a
|
|
| MD5 |
a141fb456b221d27f591034acb5c45c1
|
|
| BLAKE2b-256 |
9a3535613bf53f31944c2d7b9963a07b152263314e165294c6f50dad5e564fe6
|
File details
Details for the file cua_som-0.1.1-py3-none-any.whl.
File metadata
- Download URL: cua_som-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bebb7dbed110fe0306658f2c9023cd9743c0758ac81b3f2c2b7cc2bdc5234ad
|
|
| MD5 |
b2a1218460d7e200d686c06fc7b946ff
|
|
| BLAKE2b-256 |
f43d21cabff1a8b3bc746691f04ff400f2a37069b3e283cd179fd90e91ac1fda
|