Model-Agnostic Task Architecture - A task-centric, model-agnostic framework for computer vision
Project description
MATA | Model-Agnostic Task Architecture
A task-centric, model-agnostic framework for computer vision that provides a YOLO-like UX on top of modular, license-safe backends.
MATA focuses on stable task contracts and pluggable runtimes, allowing you to switch between different models and frameworks without changing your code.
๐ฏ Key Features
- Universal Model Loading: llama.cpp-style loading - use any model by HuggingFace ID, file path, or alias
- Multi-Task Support: Detection, classification, segmentation, depth estimation, OCR (text extraction), vision-language models, and multi-modal pipelines
- Zero-Shot Capabilities: CLIP (classify), GroundingDINO/OWL-ViT (detect), SAM/SAM3 (segment) - no training required
- Vision-Language Models: Image captioning, VQA, and visual understanding with Qwen3-VL and more
- Multi-Format Runtime: PyTorch โ | ONNX Runtime โ | TorchScript โ | Torchvision โ | TensorRT (planned)
- Graph System (v1.6): Multi-task workflows with
mata.infer(), parallel execution, conditional branching, and video tracking - Object Tracking (v1.8+):
mata.track()โ Video/stream tracking with vendored ByteTrack and BotSort, persistent track IDs, trajectory trails, CSV/JSON export, and appearance-based ReID (v1.9.2) - OCR / Text Extraction (v1.9):
mata.run("ocr", ...)โ extract printed and handwritten text using GOT-OCR2, TrOCR, EasyOCR, PaddleOCR, or Tesseract with per-region confidence and bounding boxes - Valkey/Redis Result Storage (v1.9): persist any result to Valkey/Redis with
result.save("valkey://host/key")or viaValkeyStore/ValkeyLoadgraph nodes โ enables distributed pipelines and cross-process result sharing - Validation & Evaluation:
mata.val()โ mAP/accuracy/depth metrics against COCO, ImageNet, or custom datasets - Export & Visualization: Save as JSON/CSV/image overlays/crops with dual backends (PIL/matplotlib)
- Task-First API: Specify what you want (detect, segment, classify, depth, ocr, vlm), not which model to use
- Model-Agnostic: Swap models without changing code - all models implement the same task contracts
- Config Aliases: Define shortcuts for commonly used models in YAML config files
- License-Safe: Apache 2.0 licensed with clear separation of components
- Auto-Discovery: Automatic architecture detection from model IDs and file formats
๐ฆ Installation
CPU-Only Installation (Smaller, works everywhere)
git clone https://github.com/datamata-io/mata.git
cd MATA
# Install PyTorch CPU version first
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
# Then install MATA
pip install -e .
GPU Installation (Faster, requires NVIDIA GPU + CUDA)
git clone https://github.com/datamata-io/mata.git
cd MATA
# Install PyTorch with CUDA 12.1 support (check your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Then install MATA
pip install -e .
Check your CUDA version: Run nvidia-smi to see your CUDA version, then choose the appropriate PyTorch wheel:
- CUDA 11.8:
https://download.pytorch.org/whl/cu118 - CUDA 12.1:
https://download.pytorch.org/whl/cu121 - CUDA 12.4:
https://download.pytorch.org/whl/cu124 - CUDA 12.6:
https://download.pytorch.org/whl/cu126 - CUDA 13.0:
https://download.pytorch.org/whl/cu130 - CPU only:
https://download.pytorch.org/whl/cpu
๐ Need help choosing? See INSTALLATION.md for:
- Detailed GPU vs CPU comparison
- How to switch between GPU and CPU
- Troubleshooting GPU installation issues
- Verifying your installation
See also PyTorch Get Started for more options.
Development Installation
pip install -e ".[dev]"
Optional Dependencies
# For ONNX model support
pip install onnxruntime # CPU
pip install onnxruntime-gpu # GPU
# For publication-quality visualizations
pip install matplotlib
# For Valkey/Redis result storage
pip install datamata[valkey] # valkey-py client (recommended)
pip install datamata[redis] # redis-py client (alternative)
๐ Quick Start
Object Detection
import mata
# Transformer-based detection (RT-DETR)
result = mata.run("detect", "image.jpg",
model="PekingU/rtdetr_r18vd",
threshold=0.4)
# CNN-based detection (Apache 2.0, torchvision)
result = mata.run("detect", "image.jpg",
model="torchvision/retinanet_resnet50_fpn",
threshold=0.4)
# Access detections
for det in result.instances:
print(f"{det.label_name}: {det.score:.2f} at {det.bbox}")
Image Classification
import mata
# Standard classification
result = mata.run("classify", "image.jpg",
model="microsoft/resnet-50")
print(f"Top prediction: {result.top1.label_name} ({result.top1.score:.2%})")
# Access top-5 predictions
for cls in result.top5:
print(f"{cls.label_name}: {cls.score:.2%}")
Instance & Panoptic Segmentation
import mata
# Instance segmentation (countable objects)
result = mata.run("segment", "image.jpg",
model="facebook/mask2former-swin-tiny-coco-instance",
threshold=0.5)
# Panoptic segmentation (instances + stuff)
result = mata.run("segment", "image.jpg",
model="facebook/mask2former-swin-tiny-coco-panoptic",
segment_mode="panoptic")
# Filter results
instances = result.get_instances() # Countable objects
stuff = result.get_stuff() # Background regions
Depth Estimation
import mata
result = mata.run("depth", "image.jpg",
model="depth-anything/Depth-Anything-V2-Small-hf",
normalize=True)
# Save with colormap
result.save("depth.png", colormap="magma")
Multi-Task Graph System (New in v1.6)
Execute complex multi-task vision workflows with strongly-typed graphs:
import mata
from mata.nodes import Detect, Filter, PromptBoxes, Fuse
# Load models
detector = mata.load("detect", "IDEA-Research/grounding-dino-tiny")
segmenter = mata.load("segment", "facebook/sam-vit-base")
# Run multi-task graph
result = mata.infer(
image="image.jpg",
graph=[
Detect(using="detector", text_prompts="cat . dog", out="dets"),
Filter(src="dets", score_gt=0.3, out="filtered"),
PromptBoxes(using="segmenter", dets="filtered", out="masks"),
Fuse(dets="filtered", masks="masks", out="final"),
],
providers={"detector": detector, "segmenter": segmenter}
)
# Access results
print(f"Detected {len(result.final.detections)} objects")
print(f"Segmented {len(result.final.masks)} instances")
Run independent tasks in parallel for 1.5-3x speedup:
from mata.nodes import Detect, Classify, EstimateDepth, Fuse
from mata.core.graph import Graph
result = mata.infer(
image="scene.jpg",
graph=Graph("scene_analysis").parallel([
Detect(using="detector", out="dets"),
Classify(using="classifier", text_prompts=["indoor", "outdoor"], out="cls"),
EstimateDepth(using="depth", out="depth"),
]).then(
Fuse(dets="dets", classification="cls", depth="depth", out="scene")
),
providers={
"detector": mata.load("detect", "facebook/detr-resnet-50"),
"classifier": mata.load("classify", "openai/clip-vit-base-patch32"),
"depth": mata.load("depth", "depth-anything/Depth-Anything-V2-Small-hf"),
}
)
Use pre-built presets for common workflows:
from mata.presets import grounding_dino_sam, full_scene_analysis
# One-liner detection + segmentation
result = mata.infer("image.jpg", grounding_dino_sam(), providers={...})
Backward Compatible: All existing mata.load() and mata.run() APIs work unchanged.
See Graph System Guide | API Reference | Cookbook | Examples
Real-World Industry Scenarios
MATA includes 20 ready-to-use scenarios for industrial applications:
| Industry | Scenarios | Example |
|---|---|---|
| Manufacturing | Defect detection, assembly verification, component inspection | from mata.presets import defect_detect_classify |
| Retail | Shelf analysis, product search, stock assessment | from mata.presets import shelf_product_analysis |
| Autonomous Driving | Distance estimation, scene analysis, traffic tracking | from mata.presets import road_scene_analysis |
| Security | Crowd monitoring, suspicious object detection | from mata.presets import crowd_monitoring |
| Agriculture | Disease classification, aerial crop analysis, pest detection | from mata.presets import aerial_crop_analysis |
| Healthcare | ROI segmentation, report generation, pathology triage | See examples |
See Real-World Scenarios Guide | Example Scripts
Object Tracking (New in v1.8)
Track objects across video frames using ByteTrack or BotSort โ with persistent track IDs:
import mata
# Track objects in a video file (returns list[VisionResult])
results = mata.track(
"video.mp4",
model="facebook/detr-resnet-50",
tracker="botsort", # or "bytetrack"
conf=0.3,
save=True,
show_track_ids=True,
)
for frame_idx, result in enumerate(results):
for inst in result.instances:
print(f"Frame {frame_idx}: Track #{inst.track_id} "
f"{inst.label_name} ({inst.score:.2f}) @ {inst.bbox}")
Memory-efficient streaming for long videos and RTSP streams:
# stream=True returns a generator โ constant memory regardless of video length
for result in mata.track("rtsp://camera/stream",
model="facebook/detr-resnet-50",
stream=True):
active = [i for i in result.instances if i.track_id is not None]
print(f"Active tracks: {len(active)}")
Low-level persistent tracking (persist=True):
import cv2
import mata
tracker = mata.load("track", "facebook/detr-resnet-50", tracker="bytetrack")
cap = cv2.VideoCapture("video.mp4")
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
result = tracker.update(frame, persist=True)
# result.instances have .track_id set
cap.release()
Graph-based tracking pipelines:
from mata.nodes import Detect, Filter, Track, Annotate
graph = [
Detect(using="detr", out="dets"),
Filter(src="dets", labels=["person", "car"], out="filtered"),
Track(using="botsort", dets="filtered", out="tracks"),
Annotate(using="drawer", dets="tracks", show_track_ids=True, out="annotated"),
]
result = mata.infer(graph=graph, video="video.mp4", providers={...})
Supported Source Types
| Source | Example | Notes |
|---|---|---|
| Video file | "video.mp4" |
.mp4, .avi, .mkv, .mov, .wmv |
| RTSP stream | "rtsp://..." |
Live camera feeds |
| HTTP stream | "http://..." |
IP cameras, web streams |
| Webcam | 0 (int) |
Local camera by device index |
| Image directory | "frames/" |
Sorted by filename |
| Single image | "image.jpg" |
Returns 1-frame result |
| numpy array | np.ndarray |
Direct frame input |
| PIL Image | Image.open(...) |
Direct frame input |
ByteTrack vs BotSort
| Feature | ByteTrack | BotSort |
|---|---|---|
| Algorithm | Two-stage IoU association | IoU + Global Motion Compensation (GMC) |
| Camera motion | โ No | โ Sparse optical flow compensation |
| Speed | Faster | Slightly slower |
| Accuracy | Good | Better (especially for panning cameras) |
| Default | No | Yes (MATA default, matches Ultralytics) |
| ReID | โ No | โ
v1.9.2 (reid_model= kwarg) |
Configuration via YAML:
# .mata/models.yaml
models:
track:
highway-cam:
source: "facebook/detr-resnet-50"
tracker: botsort
tracker_config:
track_high_thresh: 0.6
track_buffer: 60
frame_rate: 30
tracker = mata.load("track", "highway-cam")
Appearance-Based ReID (New in v1.9.2)
Enable appearance re-identification with BotSort to recover track IDs after occlusion or re-entry:
# Pass any HuggingFace image encoder (ViT, CLIP, OSNet, etc.) as a ReID model
results = mata.track(
"video.mp4",
model="facebook/detr-resnet-50",
tracker="botsort",
reid_model="openai/clip-vit-base-patch32", # appearance encoder
conf=0.3,
save=True,
)
ONNX models are also supported for production deployment:
# Use a local .onnx ReID model for low-latency inference
results = mata.track(
"video.mp4",
model="facebook/detr-resnet-50",
reid_model="osnet_x1_0.onnx", # local ONNX ReID model
)
ReID can also be declared in a config alias:
# .mata/models.yaml
models:
track:
smart-cam:
source: "facebook/detr-resnet-50"
tracker: botsort
reid_model: "openai/clip-vit-base-patch32"
tracker_config:
track_high_thresh: 0.6
appearance_thresh: 0.25
tracker = mata.load("track", "smart-cam") # ReID loaded automatically
Cross-camera re-identification via Valkey:
from mata.trackers import ReIDBridge
# Camera 1 โ publish embeddings
bridge = ReIDBridge("valkey://localhost:6379", camera_id="cam-1")
results = mata.track("rtsp://cam1/stream", model="detr",
reid_model="openai/clip-vit-base-patch32",
reid_bridge=bridge, stream=True)
# Camera 2 โ query nearest identity
bridge2 = ReIDBridge("valkey://localhost:6379", camera_id="cam-2")
# Embeddings from cam-1 are queryable cross-camera with cosine similarity
See Tracking Examples | Quick Reference
OCR / Text Extraction (New in v1.9)
Extract text from images and documents using any of five backends:
import mata
# EasyOCR โ 80+ languages, polygon bounding boxes
result = mata.run("ocr", "document.jpg", model="easyocr")
print(result.full_text)
for region in result.regions:
print(f"{region.text!r} (confidence: {region.score:.2f}) @ {region.bbox}")
# PaddleOCR โ multilingual, strong on non-Latin scripts
result = mata.run("ocr", "document.jpg", model="paddleocr", lang="zh")
# Tesseract โ classic open-source engine
result = mata.run("ocr", "scan.tiff", model="tesseract", lang="eng")
# HuggingFace GOT-OCR2 โ state-of-the-art end-to-end OCR
result = mata.run("ocr", "document.jpg",
model="stepfun-ai/GOT-OCR-2.0-hf")
# HuggingFace TrOCR โ best for single text-line crops
result = mata.run("ocr", "line_crop.png",
model="microsoft/trocr-base-handwritten")
Export results in multiple formats:
result.save("output.txt") # plain text
result.save("output.csv") # CSV: text, score, x1, y1, x2, y2
result.save("output.json") # structured JSON
result.save("overlay.png") # image with bbox overlays
# Filter by confidence
high_conf = result.filter_by_score(0.85)
Pipeline: detect text regions, then OCR each crop:
from mata.nodes import Detect, Filter, ExtractROIs, OCR, Fuse
graph = (
Detect(using="detector", out="dets")
>> Filter(src="dets", label_in=["sign", "license_plate"], out="filtered")
>> ExtractROIs(src_dets="filtered", out="rois")
>> OCR(using="ocr_engine", src="rois", out="ocr_result")
>> Fuse(out="final", dets="filtered", ocr="ocr_result")
)
result = mata.infer(graph, image="street.jpg", providers={
"detector": mata.load("detect", "facebook/detr-resnet-50"),
"ocr_engine": mata.load("ocr", "easyocr"),
})
Vision-Language Understanding
import mata
# Image description
result = mata.run("vlm", "image.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
prompt="Describe this image in detail.")
print(result.text)
# Visual question answering
result = mata.run("vlm", "image.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
prompt="How many people are in this image?")
print(f"Answer: {result.text}")
# Domain-specific analysis with system prompt
result = mata.run("vlm", "product.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
prompt="Describe any defects you see.",
system_prompt="You are a manufacturing quality inspector. Be precise and technical.")
# Structured output parsing (v1.5.4+)
result = mata.run("vlm", "image.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
prompt="List all objects you see.",
output_mode="detect")
for entity in result.entities:
print(f"{entity.label}: {entity.score:.2f}")
# Auto-promotion: VLM bboxes โ Instance objects (v1.5.4+)
# Enables direct comparison with spatial detection models
result = mata.run("vlm", "image.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
prompt="Detect objects with bboxes in JSON format.",
output_mode="detect",
auto_promote=True) # Bboxes in JSON โ Instance objects
for instance in result.instances:
print(f"{instance.label_name}: {instance.bbox} ({instance.score:.2f})")
# Multi-image comparison (v1.5.4+)
result = mata.run("vlm", "image1.jpg",
model="Qwen/Qwen3-VL-2B-Instruct",
images=["image2.jpg"],
prompt="What's different between these images?")
print(result.text)
๐ Available Detection Models
Transformer Models (HuggingFace)
- RT-DETR (facebook/rt-detr-*): Fast, anchor-free transformer (recommended)
- DETR (facebook/detr-*): Original detection transformer
- Grounding DINO (IDEA-Research/grounding-dino-*): Zero-shot detection
CNN Models (Torchvision - Apache 2.0)
- RetinaNet (torchvision/retinanet_resnet50_fpn): Fast, single-stage (~40 FPS)
- Faster R-CNN (torchvision/fasterrcnn_resnet50_fpn_v2): High accuracy (~25 FPS)
- FCOS (torchvision/fcos_resnet50_fpn): Anchor-free detection (~30 FPS)
- SSD (torchvision/ssd300_vgg16): Very fast, mobile-friendly (~60 FPS)
| Model | Type | mAP (COCO) | Speed (RTX 3080) | License |
|---|---|---|---|---|
| RT-DETR R18 | Transformer | 40.7 | ~50 FPS | Apache 2.0 |
| RetinaNet | CNN | 39.8 | ~40 FPS | Apache 2.0 |
| Faster R-CNN V2 | CNN | 42.2 | ~25 FPS | Apache 2.0 |
| DETR ResNet-50 | Transformer | 42.0 | ~20 FPS | Apache 2.0 |
Universal Model Loading
MATA features llama.cpp-style universal loading - load any model by path or ID:
import mata
# Load from HuggingFace (auto-detects model type)
detector = mata.load("detect", "facebook/detr-resnet-50")
# Load from local ONNX file
detector = mata.load("detect", "model.onnx", threshold=0.4)
# Load from config alias (setup in ~/.mata/models.yaml)
detector = mata.load("detect", "rtdetr-fast")
# Run inference
result = detector.predict("image.jpg")
๐ฏ Zero-Shot Capabilities
Perform vision tasks without training on specific classes - just provide text prompts.
Zero-Shot Classification (CLIP)
Classify images into runtime-defined categories:
import mata
# Basic zero-shot classification
result = mata.run("classify", "image.jpg",
model="openai/clip-vit-base-patch32",
text_prompts=["cat", "dog", "bird", "car"])
print(f"Prediction: {result.top1.label_name} ({result.top1.score:.2%})")
# Use ensemble templates for +2-5% accuracy
result = mata.run("classify", "image.jpg",
model="openai/clip-vit-base-patch32",
text_prompts=["cat", "dog", "bird"],
template="ensemble") # 7 prompt variations
When to use CLIP:
- Novel categories not in ImageNet
- Dynamic category lists (user-defined)
- Rapid prototyping without training
- Domain-specific classification
Zero-Shot Detection (GroundingDINO / OWL-ViT)
Detect objects using text descriptions:
import mata
# Text-prompted object detection
result = mata.run("detect", "image.jpg",
model="IDEA-Research/grounding-dino-tiny",
text_prompts="cat . dog . person . car", # Dot-separated
threshold=0.25)
# Multiple phrases
result = mata.run("detect", "image.jpg",
model="google/owlv2-base-patch16",
text_prompts="red apple . green apple . banana")
When to use text-prompted detection:
- Detecting objects not in COCO dataset (80 classes)
- Attribute-specific detection ("red car", "open door")
- Rare object detection without training data
Zero-Shot Segmentation (SAM / SAM3)
Segment any object with prompts:
import mata
# Point-based segmentation (click coordinates)
result = mata.run("segment", "image.jpg",
model="facebook/sam-vit-base",
point_prompts=[(320, 240, 1)], # (x, y, foreground)
threshold=0.8)
# Box-based segmentation (region of interest)
result = mata.run("segment", "image.jpg",
model="facebook/sam-vit-base",
box_prompts=[(50, 50, 400, 400)]) # (x1, y1, x2, y2)
# Text-based segmentation (SAM3 - 270K+ concepts)
result = mata.run("segment", "image.jpg",
model="facebook/sam3",
text_prompts="cat",
threshold=0.5)
When to use SAM:
- Interactive segmentation workflows
- Novel object segmentation (not in training sets)
- Class-agnostic mask generation
- Combined with detection pipelines
๐ญ Segmentation Tasks
MATA supports four segmentation modes: instance, panoptic, semantic, and zero-shot (SAM).
Instance, Panoptic, and Semantic Segmentation
import mata
# Instance segmentation (countable objects)
result = mata.run(
"segment",
"image.jpg",
model="facebook/mask2former-swin-tiny-coco-instance",
threshold=0.5
)
# Panoptic segmentation (instances + stuff)
result = mata.run(
"segment",
"image.jpg",
model="facebook/mask2former-swin-tiny-coco-panoptic",
segment_mode="panoptic"
)
# Semantic segmentation (per-pixel classification)
result = mata.run(
"segment",
"image.jpg",
model="nvidia/segformer-b0-finetuned-ade-512-512"
)
# Filter results
instances = result.get_instances() # Countable objects
stuff = result.get_stuff() # Background regions
Zero-Shot Segmentation (SAM - Segment Anything)
New in v1.5.1: SAM enables prompt-based segmentation without training on specific classes.
import mata
# Point prompt (click on object)
result = mata.run(
"segment",
"image.jpg",
model="facebook/sam-vit-base",
point_prompts=[(320, 240, 1)], # (x, y, foreground)
threshold=0.8 # Filter low-quality masks
)
# Box prompt (region of interest)
result = mata.run(
"segment",
"image.jpg",
model="facebook/sam-vit-base",
box_prompts=[(50, 50, 400, 400)] # (x1, y1, x2, y2)
)
# Combined prompts for refinement
result = mata.run(
"segment",
"image.jpg",
model="facebook/sam-vit-base",
point_prompts=[(320, 240, 1), (100, 100, 0)], # Foreground + background
box_prompts=[(50, 50, 450, 450)],
threshold=0.85
)
# SAM returns 3 masks per prompt - use the best one
best_mask = result.masks[0] # Already sorted by IoU score
print(f"IoU: {best_mask.score:.3f}, Area: {best_mask.area} pixels")
SAM Features:
- โ Class-agnostic (segments any object with prompts)
- โ Multi-mask output (3 predictions per prompt with IoU scores)
- โ Point prompts (foreground/background clicks)
- โ Box prompts (rectangular regions)
- โ Combined prompting for refinement
When to use SAM:
- Segmenting novel objects (not in COCO/ADE20K datasets)
- Interactive segmentation workflows
- When class labels are not needed
- Prompt-based mask generation
SAM Models:
facebook/sam-vit-base: Fast, good quality (recommended for prototyping)facebook/sam-vit-large: Slower, better qualityfacebook/sam-vit-huge: Slowest, best quality
See examples/segment/sam_segment.py for comprehensive SAM examples.
๐ Available Models
Object Detection
| Model | HuggingFace ID | Runtime Support | Description |
|---|---|---|---|
| DETR | facebook/detr-resnet-50 |
PyTorch โ ONNX โ | Original transformer detector |
| Conditional DETR | microsoft/conditional-detr-resnet-50 |
PyTorch โ | Fast convergence |
| GroundingDINO (zero) | IDEA-Research/grounding-dino-tiny |
PyTorch โ | Text-prompted detection |
| OWL-ViT v2 (zero) | google/owlv2-base-patch16 |
PyTorch โ | Open-vocabulary detection |
Image Classification
| Model | HuggingFace ID | Runtime Support | Description |
|---|---|---|---|
| ResNet | microsoft/resnet-50 |
PyTorch โ ONNX โ TorchScript โ | Classic CNN architecture |
| ViT | google/vit-base-patch16-224 |
PyTorch โ ONNX โ TorchScript โ | Vision transformer |
| ConvNeXt | facebook/convnext-base-224 |
PyTorch โ ONNX โ | Modern CNN |
| EfficientNet | google/efficientnet-b0 |
PyTorch โ ONNX โ | Efficient scaling |
| Swin | microsoft/swin-base-patch4-window7-224 |
PyTorch โ | Hierarchical transformer |
| CLIP (zero) | openai/clip-vit-base-patch32 |
PyTorch โ | Zero-shot classification |
Instance & Panoptic Segmentation
| Model | HuggingFace ID | Mode | Description |
|---|---|---|---|
| Mask2Former | facebook/mask2former-swin-tiny-coco-instance |
Instance | High-quality instance masks |
| Mask2Former | facebook/mask2former-swin-tiny-coco-panoptic |
Panoptic | Instance + stuff regions |
| MaskFormer | facebook/maskformer-swin-tiny-ade |
Instance | Unified segmentation |
| SAM (zero) | facebook/sam-vit-base |
Prompt-based | Point/box prompts |
| SAM3 (zero) | facebook/sam3 |
Text prompts | 270K+ concepts |
Depth Estimation
| Model | HuggingFace ID | Description |
|---|---|---|
| Depth Anything V2 | depth-anything/Depth-Anything-V2-Small-hf |
Fast, good quality (recommended) |
| Depth Anything V2 | depth-anything/Depth-Anything-V2-Base-hf |
Balanced speed/quality |
| Depth Anything V1 | LiheYoung/depth-anything-small-hf |
Original version |
Vision-Language Models
| Model | HuggingFace ID | Description |
|---|---|---|
| Qwen3-VL 2B | Qwen/Qwen3-VL-2B-Instruct |
Fast, chat-capable VLM (recommended for dev) |
All models support:
- Single-image and batch inference
- PIL Images, file paths, or numpy arrays as input
- Configurable thresholds and parameters
- GPU and CPU execution
- Consistent output formats (xyxy bbox, RLE masks, etc.)
๐๏ธ Architecture
MATA separates concerns into layers:
User Code (mata.run() / mata.load() / mata.infer())
โ
UniversalLoader (Auto-detection chain)
โ
Task Adapters (HuggingFace / ONNX / TorchScript / PyTorch)
โ โ
VisionResult (Single-task) Graph System (Multi-task v1.6)
โโ instances: List[Instance] โโ Artifact type system
โโ entities: List[Entity] โโ Task nodes & providers
โโ text: str (VLM responses) โโ Parallel/conditional execution
โ โโ MultiResult output
Runtime Layer (PyTorch / ONNX Runtime / TorchScript)
โ
Export & Visualization (JSON/CSV/Image/Crops)
Multi-Modal Pipeline Support:
GroundingDINO (text โ bbox) + SAM (bbox โ mask) = Instance Segmentation
VLM (semantic) โ GroundingDINO (spatial) โ SAM (masks) = VLM-Grounded Segmentation
Key Concepts
- Task: What you want to accomplish (detect, segment, classify, depth, vlm, pipeline)
- Adapter: Model-specific implementation of a task contract
- UniversalLoader: Auto-detects model source (HuggingFace ID, local file, config alias)
- VisionResult: Unified result type supporting multi-modal outputs
- Instance: Spatial detections/segments (bbox + mask + embedding)
- Entity: Semantic labels from VLM output (label + score + attributes)
- Graph (v1.6): Directed acyclic graph of typed nodes with auto-wiring
- Node: Typed processing unit (Detect, Filter, Fuse, etc.)
- Provider: Protocol-based model wrapper with capability detection
- MultiResult: Bundled output from multi-task graph execution
- Runtime: Execution backend (PyTorch, ONNX Runtime, TorchScript, TensorRT planned)
๐พ Export & Visualization
Save Results in Multiple Formats
import mata
result = mata.run("detect", "image.jpg")
# Auto-format detection based on file extension
result.save("output.json") # JSON with detections
result.save("output.csv") # CSV table format
result.save("output.png") # Image with bbox overlays
# Export detection crops
result.save("crops/", crop_dir="crops/") # Individual object images
Visualization Options
from mata import visualize_segmentation
# Segmentation visualization with custom settings
result = mata.run("segment", "image.jpg",
model="facebook/mask2former-swin-tiny-coco-instance")
vis = visualize_segmentation(
result,
"image.jpg",
alpha=0.6, # Mask transparency
show_boxes=True, # Show bounding boxes
show_labels=True, # Show class labels
show_scores=True, # Show confidence scores
backend="matplotlib" # or "pil" (default)
)
vis.show()
vis.save("segmentation_output.png")
Depth Visualization with Colormaps
result = mata.run("depth", "image.jpg",
model="depth-anything/Depth-Anything-V2-Small-hf",
normalize=True)
# Save with different colormaps
result.save("depth_magma.png", colormap="magma") # Hot colors
result.save("depth_viridis.png", colormap="viridis") # Blue-green
result.save("depth_plasma.png", colormap="plasma") # Purple-yellow
Auto-Image Storage
MATA automatically stores the input image path in result metadata:
result = mata.run("detect", "my_image.jpg")
result.save("output.png") # โ
Auto-uses stored input path
# No need to pass image path again!
๐งช Testing
# Run all tests (4047+ total)
pytest tests/ -v
# Run with coverage (target: >80%)
pytest --cov=mata --cov-report=html
# Run specific test suites
pytest tests/test_classify_adapter.py -v # Classification (35 tests)
pytest tests/test_segment_adapter.py -v # Segmentation (30 tests)
pytest tests/test_clip_adapter.py -v # CLIP zero-shot (46 tests)
pytest tests/test_sam_adapter.py -v # SAM segmentation (19 tests)
pytest tests/test_torchvision_detect_adapter.py -v # Torchvision CNN (32 tests)
pytest tests/test_vlm_adapter.py -v # VLM (63 tests, 57 unit + 6 slow integration)
# Tracking test suites (v1.8)
pytest tests/test_trackers/ -v # Vendored trackers (354 tests)
pytest tests/test_tracking_adapter.py -v # TrackingAdapter (73 tests)
pytest tests/test_track_api.py -v # mata.track() public API (62 tests)
pytest tests/test_tracking_visualization.py -v # Visualization & export (103 tests)
pytest tests/test_video_io.py -v # Video I/O utilities (56 tests)
Test Coverage by Task:
- Detection: 72+ tests (HuggingFace, ONNX, TorchScript, Torchvision, zero-shot)
- Classification: 35+ tests (multi-format, CLIP)
- Segmentation: 30+ tests (instance, panoptic, SAM, SAM3)
- Depth: 10+ tests (Depth Anything V1/V2)
- VLM: 63 tests (57 unit tests + 6 slow integration tests, includes structured output + multi-image)
- Graph System: 1000+ tests (nodes, scheduler, conditionals, presets, infer API, backward compat)
- Universal Loader: 17 tests (5-strategy detection chain)
- Object Tracking: 687 tests (vendored trackers, adapter, API, visualization, video I/O)
๐งช Validation & Evaluation
mata.val() evaluates any model against a labeled dataset and returns structured metrics for all four supported tasks.
Quick Reference
import mata
# Detection โ COCO mAP
metrics = mata.val(
"detect",
model="facebook/detr-resnet-50",
data="examples/configs/coco.yaml",
iou=0.50,
conf=0.001,
verbose=True,
save_dir="runs/val/detect",
)
print(f"mAP50 : {metrics.box.map50:.3f}")
print(f"mAP50-95 : {metrics.box.map:.3f}")
# Segmentation โ box + mask mAP
metrics = mata.val(
"segment",
model="shi-labs/oneformer_coco_swin_large",
data="examples/configs/coco.yaml",
verbose=True,
)
print(f"Box mAP50 : {metrics.box.map50:.3f}")
print(f"Mask mAP50 : {metrics.seg.map50:.3f}")
# Classification โ top-k accuracy
metrics = mata.val(
"classify",
model="microsoft/resnet-101",
data="examples/configs/imagenet.yaml",
verbose=True,
)
print(f"Top-1 : {metrics.top1:.1%}")
print(f"Top-5 : {metrics.top5:.1%}")
# Depth โ AbsRel / RMSE / ฮด thresholds
metrics = mata.val(
"depth",
model="depth-anything/Depth-Anything-V2-Small-hf",
data="examples/configs/diode.yaml",
verbose=True,
)
print(f"AbsRel : {metrics.abs_rel:.4f}")
print(f"RMSE : {metrics.rmse:.4f}")
print(f"ฮด < 1.25 : {metrics.delta_1:.1%}")
Sample Console Output (detection, verbose=True)
Class Images Instances P R mAP50 mAP50-95
all 5000 36335 0.512 0.487 0.491 0.312
person 5000 10777 0.681 0.621 0.643 0.421
car 5000 1918 0.574 0.531 0.548 0.387
...
Speed: pre-process 2.1ms, inference 48.3ms, post-process 1.8ms per image
Results saved to runs/val/detect/
Dataset YAML Format
Create a YAML file that points mata.val() at your dataset. The format is YOLO-compatible:
# examples/configs/coco.yaml
path: /data/coco # root directory (absolute or relative to CWD)
val: images/val2017 # sub-directory containing validation images
annotations: annotations/instances_val2017.json # COCO-format annotation JSON
names: # optional: class-index โ name mapping
0: person
1: bicycle
2: car
# ...
For classification and depth, the format is the same โ only the annotation content differs:
# examples/configs/imagenet.yaml
path: /data/imagenet
val: val
annotations: imagenet_val_labels.json # {filename: class_index} mapping
# examples/configs/diode.yaml
path: /data/diode
val: val/indoors
annotations: diode_val_depth.json # {filename: depth_npy_path} mapping
Downloading Datasets
COCO (Detection & Segmentation)
The COCO 2017 validation split is ~1 GB of images plus ~240 MB of annotations.
# Create a root directory
mkdir -p /data/coco && cd /data/coco
# Validation images (1 GB)
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip # โ val2017/
# Annotations (241 MB โ includes instances, captions, keypoints)
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip # โ annotations/
Expected layout:
/data/coco/
โโโ images/val2017/ # ~5 000 validation images
โโโ annotations/
โโโ instances_val2017.json
Point coco.yaml at this directory:
path: /data/coco
val: images/val2017
annotations: annotations/instances_val2017.json
ImageNet (Classification)
The ImageNet ILSVRC 2012 validation split requires a free account on image-net.org. Once downloaded:
mkdir -p /data/imagenet && cd /data/imagenet
# Unpack the official validation archive (supplied by image-net.org)
tar -xvf ILSVRC2012_img_val.tar # โ val/ (50 000 JPEG images)
# Unpack the bbox annotations (also supplied by image-net.org)
tar -xvf ILSVRC2012_bbox_val_v3.tgz # โ bbox_val/ (50 000 XML files)
# Generate the MATA annotation file from the XML ground truth
cd /path/to/MATA
python scripts/generate_imagenet_val_labels.py
# โ data/imagenet/imagenet_val_labels.json (50 000 entries, 1 000 classes)
Expected layout:
/data/imagenet/
โโโ val/ # 50 000 validation images (flat directory)
โโโ bbox_val/ # 50 000 XML ground-truth annotations
โโโ imagenet_val_labels.json # {filename: class_index} mapping (generated)
Point imagenet.yaml at this directory:
path: /data/imagenet
val: val
annotations: imagenet_val_labels.json
Tip: Many research groups mirror a 10 000-image mini-val subset. Any folder of ImageNet validation images with a matching
imagenet_val_labels.jsonfile will work.
DIODE (Depth Estimation)
DIODE provides high-quality outdoor and indoor depth maps. Download the validation split only (~5 GB):
mkdir -p /data/diode && cd /data/diode
# Indoor validation split (~2.3 GB)
wget http://diode-dataset.s3.amazonaws.com/val.tar.gz
tar -xvf val.tar.gz # โ val/indoors/ and val/outdoors/
# Generate the MATA annotation file from the dataset
python scripts/build_diode_depth_gt.py --diode-root /data/diode/val --output /data/diode/diode_val_depth.json
Expected layout:
/data/diode/
โโโ val/
โ โโโ indoors/ # scene_*/scan_*/ โ RGB .png + depth .npy pairs
โ โโโ outdoors/
โโโ diode_val_depth.json # {rel_rgb_path: rel_depth_npy_path} mapping
Point diode.yaml at this directory:
path: /data/diode
val: val/indoors # or val/outdoors for outdoor scenes
annotations: diode_val_depth.json
Standalone Mode (pre-run predictions)
Compute metrics from predictions collected earlier โ useful when inference runs separately (e.g. on a GPU cluster):
# Step 1: collect predictions however you like
predictions = [mata.run("detect", img, model="facebook/detr-resnet-50") for img in images]
# Step 2: score against ground truth without re-running inference
metrics = mata.val(
"detect",
predictions=predictions,
ground_truth="annotations.json", # COCO-format JSON
iou=0.50,
conf=0.001,
verbose=True,
)
See examples/validation.py for complete, runnable examples of all four tasks plus the standalone workflow.
๐ Result Format
All detection results follow a consistent format:
{
"detections": [
{
"bbox": [x1, y1, x2, y2],
"score": 0.95,
"label": 0,
"label_name": "person"
}
],
"meta": {
"model_id": "facebook/detr-resnet-50",
"threshold": 0.4,
"device": "cuda"
}
}
Coordinate System: All bboxes are in xyxy format with absolute pixel coordinates.
โ๏ธ Configuration
Override defaults via configuration file or environment:
from mata import MATAConfig, set_config
config = MATAConfig(
default_device="cuda",
default_models={"detect": "dino"},
cache_dir="/path/to/cache",
log_level="DEBUG"
)
set_config(config)
Or via environment variable:
export MATA_CONFIG=/path/to/config.json
๐ฃ๏ธ Roadmap
For a full history of completed features, see CHANGELOG.md.
๐ In Progress
1. Tracking ReID โ Re-identification for cross-camera / occlusion recovery
- โณ ReID model integration: Feature embeddings via HuggingFace ReID models
- โณ Cross-camera tracking: Match track IDs across camera feeds
- โณ BotSort ReID mode: Enable
with_reid=truein botsort config - Status: Planned for v1.9.x
2. KACA Integration - MIT-licensed CNN detection with PyTorch and ONNX support
- โณ KACA PyTorch Adapter: Direct PyTorch model loading and inference
- โณ KACA ONNX Adapter: Production deployment without PyTorch dependency
- โณ Universal Loader Integration:
mata.load("detect", "kaca/det-s") - โณ Performance: Target ~45 FPS on GPU for 640x640 input (10-20x faster than DETR)
- Status: Waiting for KACA Development Team to release HuggingFace models and ONNX exports (expected Q3 2026).
2. HuggingFace Hub Features - Enhanced model discovery and management
- ๐ Local Model Indexing: Cache HuggingFace Hub queries for offline usage
- ๐ Model Recommendations: Suggest best models based on task and hardware constraints
- ๐ Batch Model Download: Pre-download common models for air-gapped environments
- ๐ Enhanced Search: Filter by task, license, performance metrics
- Status: Planned for v2.x
โณ Planned (v2.0 - Q2 2026)
- ๐ฒ Training Module: Fine-tuning support for detection and classification
- ๐ฒ TensorRT Production: Optimized inference for NVIDIA GPUs (10-50x speedup)
- ๐ฒ Mobile Deployment: ONNX quantization and mobile runtime support (TFLite, CoreML)
- ๐ฒ Model Zoo: Pre-trained weights for common tasks and datasets
- ๐ฒ Benchmark Suite: Performance comparisons across models and runtimes
- ๐ฒ Video Processing Pipeline: Optimized frame-by-frame inference with temporal cache
- ๐ฒ Breaking Changes: Remove all deprecated APIs, clean v2.0 architecture
๐ฌ Research Concepts (v2.x)
V2L HyperLoRA โ Vision-to-LoRA fast-lane adaptation for new object classes
Inspired by Doc-to-LoRA (Sakana AI, 2026), V2L HyperLoRA adapts the hypernetwork concept from language to vision: a pre-trained hypernetwork takes 5โ20 annotated images of a new object and generates RT-DETR LoRA weights in a single forward pass (~100ms). No retraining required.
- ๐ฌ HyperLoRA generator: Hypernetwork produces LoRA(A, B) matrices + class head bias for 34 RT-DETR Linear layers
- ๐ฌ DINOv2 visual context encoder: Frozen DINOv2-Small encodes exemplar crops + Fourier bbox positional encoding into dense context features
- ๐ฌ Dynamic class head extension:
class_embedLinear layer extended for new classes while preserving COCO-80 detection - ๐ฌ Eval-mode LoRA patching: Thread-safe monkey-patching of
nn.Linear.forwardโ no gradient tracking, reversible viareset() - ๐ฌ LoRA adapter serialization: ~50โ200 KB
.safetensorsfiles deployable to edge devices - ๐ฌ
mata.load("detect", ..., fast_lane=True)API: ReturnsModulatedDetectorwith.adapt(examples)/.predict()/.reset()lifecycle - ๐ฌ Fast lane / slow lane architecture: Fast lane covers the detection gap instantly; slow lane (full retrain) eventually retires the adapter
- ๐ฌ Scope: RT-DETR (R18/R50/R101, v1/v2) only โ backbone frozen, 34 encoder/decoder Linear layers targeted
- Status: Concept research โ hypernetwork meta-training pipeline and dataset design in progress
๐ฎ Future Exploration (v2.5+)
- ๐ฎ 3D Vision: Point cloud, depth fusion, 3D object detection
- ๐ฎ Enhanced VLM: Multi-image support, video understanding, structured output parsing
- ๐ฎ Edge Deployment: Raspberry Pi, Jetson Nano optimization
- ๐ฎ Auto-ML: Neural architecture search for custom tasks
๐ License
MATA is licensed under the Apache License 2.0. See LICENSE and NOTICE for full details.
โ ๏ธ Model Weights Are Not Covered by This License
MATA is a software framework only โ it does not distribute model weights. When you load a model via
mata.load(), the weights are fetched from HuggingFace Hub or your local filesystem, and are governed by their own separate license. You are responsible for complying with those terms.
Common model licenses you may encounter:
| License | Examples | Key Restriction |
|---|---|---|
| Apache 2.0 | DETR, RT-DETR, GroundingDINO, SAM, Depth Anything V2, Mask2Former | โ Permissive โ commercial use allowed |
| MIT | CLIP (OpenAI) | โ Permissive โ commercial use allowed |
| Tongyi Qianwen | Qwen-VL series | โ ๏ธ Custom terms โ review before commercial use |
| Meta Community License | LLaMA-based VLMs | โ ๏ธ Usage thresholds and restrictions apply |
| CC-BY-NC | Various fine-tuned models | ๐ด Non-commercial use only |
| AGPL-3.0 | Some open-source fine-tunes | ๐ด Copyleft โ derivative works must be open-sourced |
Some models on HuggingFace Hub require you to accept gated terms before downloading. MATA passes your HuggingFace credentials to the Hub client but does not validate or enforce model-specific license agreements on your behalf.
๐ค Contributing
Contributions welcome! Please ensure:
- All code is Apache 2.0 compatible
- Tests pass and coverage >80%
- Code follows Black formatting
- Type hints included
๐ Documentation
- Architecture Document - System architecture and design
- Zero-Shot Detection Guide - SAM, GroundingDINO, CLIP zero-shot guide
- Quick Start Guide - Getting started with MATA
- Installation Guide - Detailed installation instructions
- Config Template - Model configuration examples
- API Reference: See docstrings in
src/mata/
Graph System (v1.6):
- Graph System Guide - Architecture overview and design principles
- Graph API Reference - Full API documentation for all nodes, artifacts, and schedulers
- Graph Cookbook - Recipes and patterns for common workflows
- Graph Examples - 5 runnable examples (pipelines, parallel, VLM, presets, and more)
Object Tracking (v1.8):
- Basic Tracking - Video file tracking with save output
- Persistent Tracking - Frame-by-frame with
persist=True - Stream Tracking - Memory-efficient generator mode
- Graph Tracking - Graph pipeline integration
Examples:
- Basic Detection
- Traditional Segmentation
- SAM Zero-Shot Segmentation
- Simple Pipeline (Graph)
- Parallel Tasks (Graph)
- VLM Workflows (Graph)
- Presets Demo (Graph)
Common Issues
Getting 0 detections?
- RT-DETR models often need lower thresholds (0.2-0.3)
- Try:
result = mata.run("detect", "image.jpg", threshold=0.2) - Check
mata.list_models("detect")to verify available model aliases
Model loading slow?
- First run downloads models from HuggingFace (~100-500MB)
- Subsequent runs use cached models
๐ Acknowledgments
Built on top of:
Models:
- RT-DETRv2: PekingU
- DINO: IDEA-Research
- Conditional DETR: Microsoft
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamata-1.9.2b1.tar.gz.
File metadata
- Download URL: datamata-1.9.2b1.tar.gz
- Upload date:
- Size: 800.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e365cdb93ca02639f34f24355d90064cb02e7836f4eae451a84a8e4a6093af2b
|
|
| MD5 |
e9ca78209ff1d3542dccb6fdea7a1499
|
|
| BLAKE2b-256 |
cff8be49e1e67476aaf14c346be7bb854b49623680f9363efa4dce9965d14a28
|
Provenance
The following attestation bundles were made for datamata-1.9.2b1.tar.gz:
Publisher:
publish.yml on datamata-io/mata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamata-1.9.2b1.tar.gz -
Subject digest:
e365cdb93ca02639f34f24355d90064cb02e7836f4eae451a84a8e4a6093af2b - Sigstore transparency entry: 1066084862
- Sigstore integration time:
-
Permalink:
datamata-io/mata@aa50fb43c803269e23e285e65d09d7a118a2fcf9 -
Branch / Tag:
refs/tags/v1.9.2b1 - Owner: https://github.com/datamata-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa50fb43c803269e23e285e65d09d7a118a2fcf9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file datamata-1.9.2b1-py3-none-any.whl.
File metadata
- Download URL: datamata-1.9.2b1-py3-none-any.whl
- Upload date:
- Size: 525.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
babaf282163e1a4269c487688d1563b8f1ff665113ea1c92204a795fa700b5e2
|
|
| MD5 |
b8fe9c8cbdc0596011cda3c6e17b6e77
|
|
| BLAKE2b-256 |
7841b71edd6fc70b7ff1f3f1da9f63455f074400802acd5bb61a410ad9460e80
|
Provenance
The following attestation bundles were made for datamata-1.9.2b1-py3-none-any.whl:
Publisher:
publish.yml on datamata-io/mata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamata-1.9.2b1-py3-none-any.whl -
Subject digest:
babaf282163e1a4269c487688d1563b8f1ff665113ea1c92204a795fa700b5e2 - Sigstore transparency entry: 1066085230
- Sigstore integration time:
-
Permalink:
datamata-io/mata@aa50fb43c803269e23e285e65d09d7a118a2fcf9 -
Branch / Tag:
refs/tags/v1.9.2b1 - Owner: https://github.com/datamata-io
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa50fb43c803269e23e285e65d09d7a118a2fcf9 -
Trigger Event:
push
-
Statement type: