Visual scene understanding pipeline with multi-model detection, segmentation, and scene graph generation

These details have not been verified by PyPI

Project links

Project description

Graph of Mark (GoM)

Graph of Mark (GoM) is a visual prompting framework that transforms images into structured semantic graphs for enhanced visual scene understanding. The system integrates state-of-the-art object detection, instance segmentation, depth estimation, and relationship extraction models to construct comprehensive scene graphs that can be used as visual prompts for Multimodal Language Models (MLMs).

Graph of Marks Output Example

Example output showing detected objects with segmentation masks and spatial relationships.

Publication

This work has been accepted at the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026). The paper and supplementary materials are available in the paper/ directory.

If you use Graph of Mark in your research, please cite:

@inproceedings{gom2026aaai,
  title={Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting},
  author={Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro},
  booktitle    = {AAAI-26, Sponsored by the Association for the Advancement of Artificial Intelligence},
  year={2026},
  publisher    = {{AAAI} Press},
  year         = {2026},
}

Visit our research group website at: https://disi-unibo-nlp.github.io

Installation

From PyPI

pip install graph-of-mark

With optional dependencies:

# Install with all features
pip install "graph-of-mark[all]"

From Source

git clone https://github.com/disi-unibo-nlp/graph-of-marks.git
cd graph-of-marks
pip install -e ".[all]"

Quick Start

Check out examples/demo_gom.ipynb for detailed examples on how to use GoM.

Python API

from gom import GoM, ProcessingConfig

# Initialize the pipeline
pipeline = GoM(device="cuda")  # or "mps" for Apple Silicon, "cpu" for CPU

# Process an image with a question
config = ProcessingConfig(
    question="What objects are in the room?",
    style="gom_text_labeled",
)
result = pipeline.process("scene.jpg", config=config, save=False)

# Access results
print(f"Detected {len(result['boxes'])} objects")
print(f"Found {len(result['relationships'])} relationships")

# Display the output image
result["output_image"].show()  # PIL Image

Visual Prompting Styles

The library implements all visual prompting configurations presented in the paper:

from gom import GoM, ProcessingConfig, GOM_STYLE_PRESETS

pipeline = GoM(device="cuda")

# Use predefined style presets via ProcessingConfig
config = ProcessingConfig(
    question="Where is the bowl?",
    style="gom_text_labeled",      # Recommended for VQA tasks
    apply_question_filter=True,    # Filter objects by question relevance
)
result = pipeline.process("scene.jpg", config=config, save=False)

# Available styles:
# - "som_text": Set-of-Mark with textual IDs (baseline, no relations)
# - "som_numeric": Set-of-Mark with numeric IDs (baseline, no relations)
# - "gom_text": GoM with textual IDs and relation arrows
# - "gom_numeric": GoM with numeric IDs and relation arrows
# - "gom_text_labeled": GoM with textual IDs and labeled relations
# - "gom_numeric_labeled": GoM with numeric IDs and labeled relations

# Access scene graph representations for VLM prompting
print(result["scene_graph_text"])    # Triple format for LLM prompts
print(result["scene_graph_prompt"])  # Compact inline format

Manual configuration is also supported:

config = ProcessingConfig(
    question="What is near the table?",
    label_mode="numeric",           # "original", "numeric", or "alphabetic"
    display_relationships=True,
    display_relation_labels=True,
    aggressive_pruning=True,        # Keep only question-relevant objects
)
result = pipeline.process("scene.jpg", config=config, save=False)

Command-Line Interface

# Image preprocessing
gom-preprocess --input_file data.json --image_dir images/ --output_folder output/

# Visual Question Answering
gom-vqa --input_file vqa_data.json --model_name llava-hf/llava-1.5-7b-hf

Pipeline Overview

The GoM pipeline processes images through the following stages:

Stage	Description	Models
Detection	Object localization	YOLOv8, OWL-ViT, GroundingDINO, Detectron2
Fusion	Prediction aggregation	Weighted Box Fusion (WBF), NMS
Segmentation	Instance mask generation	SAM, SAM2, SAM-HQ, FastSAM
Depth Estimation	3D scene understanding	Depth Anything V2
Relationship Extraction	Spatial/semantic relations	CLIP-based, physics-based
Graph Construction	Scene graph generation	NetworkX

Detection Stage Depth Estimation Segmentation Stage Final GoM output

Pipeline stages: object detection, instance segmentation, depth estimation.

Return Dictionary

The process() method returns:

result = {
    "boxes": [[x1, y1, x2, y2], ...],     # Bounding boxes
    "labels": ["person", "chair", ...],    # Object labels
    "scores": [0.95, 0.87, ...],           # Confidence scores
    "masks": [np.ndarray, ...],            # Segmentation masks
    "depth": np.ndarray,                   # Depth map
    "relationships": [...],                 # Extracted relations
    "scene_graph": nx.DiGraph,             # NetworkX graph
    "scene_graph_text": "...",             # Triple format for prompts
    "scene_graph_prompt": "...",           # Compact format
    "output_image": PIL.Image.Image,       # Rendered visualization as PIL Image
    "processing_time": 12.5,               # Processing time (seconds)
}

Configuration

Visual Prompting Styles (Paper Table 2)

Style Preset	Label Mode	Relations	Relation Labels	Recommended Use
`som_text`	Textual	No	No	Set-of-Mark baseline
`som_numeric`	Numeric	No	No	Set-of-Mark baseline
`gom_text`	Textual	Yes	No	GoM with arrows
`gom_numeric`	Numeric	Yes	No	GoM with arrows
`gom_text_labeled`	Textual	Yes	Yes	VQA tasks
`gom_numeric_labeled`	Numeric	Yes	Yes	RefCOCO tasks

Pipeline Parameters

Parameter	Description	Default
`detectors_to_use`	Detection models to employ	`("yolov8",)`
`sam_version`	Segmentation model version	`"hq"`
`wbf_iou_threshold`	IoU threshold for WBF fusion	`0.55`
`label_mode`	Label format (`"original"` or `"numeric"`)	`"original"`
`display_labels`	Render object labels	`True`
`display_relationships`	Render relationship arrows	`True`
`display_relation_labels`	Render labels on arrows	`True`
`show_segmentation`	Render segmentation masks	`True`
`output_format`	Output image format	`"png"`

Complete configuration options are documented in src/gom/config.py.

Custom Model Integration

GoM supports integration of custom detection, segmentation, and depth models:

from gom import GoM, ProcessingConfig
import numpy as np

def custom_detector(image):
    # Custom detection logic
    # Returns: boxes, labels, scores
    boxes = [[100, 100, 200, 200]]
    labels = ["person"]
    scores = [0.95]
    return boxes, labels, scores

def custom_segmenter(image, boxes):
    # Custom segmentation logic
    # Returns: list of boolean masks (H, W)
    h, w = image.size[1], image.size[0]
    masks = [np.ones((h, w), dtype=bool) for _ in boxes]
    return masks

def custom_depth(image):
    # Custom depth estimation
    # Returns: depth map (H, W) normalized to [0, 1]
    h, w = image.size[1], image.size[0]
    return np.zeros((h, w), dtype=np.float32)

# Create GoM with custom functions
pipeline = GoM(
    detect_fn=custom_detector,
    segment_fn=custom_segmenter,
    depth_fn=custom_depth,
    device="cuda"
)

config = ProcessingConfig(
    question="What objects are visible?",
    style="gom_text_labeled",
)
result = pipeline.process("scene.jpg", config=config, save=False)

Examples

The examples/ directory contains:

quickstart.py: Basic usage and installation verification
demo.ipynb: Comprehensive Jupyter notebook demonstrating all features

Docker

# Build the container
docker build -f build/Dockerfile -t gom:latest .

# Run with GPU support
docker run --rm --gpus all -v $(pwd):/workdir gom:latest \
    gom-preprocess --input_file data.json

Repository Structure

graph-of-marks/
├── src/gom/                    # Main package
│   ├── api.py                  # High-level API (GoM class)
│   ├── config.py               # Configuration management
│   ├── cli/                    # Command-line interface
│   ├── detectors/              # Object detection models
│   ├── segmentation/           # Segmentation models
│   ├── fusion/                 # Detection fusion strategies
│   ├── relations/              # Relationship extraction
│   ├── graph/                  # Scene graph construction
│   ├── viz/                    # Visualization utilities
│   ├── vqa/                    # VQA inference
│   └── utils/                  # Utility functions
├── examples/                   # Usage examples
├── scripts/                    # Inference scripts
├── external_libs/              # External dependencies (SAM2)
├── paper/                      # AAAI 2026 paper
├── pyproject.toml              # Package configuration
└── Makefile                    # Build commands

License

This project is licensed under the MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Jan 23, 2026

1.0.1

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph_of_mark-1.1.0.tar.gz (331.7 kB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graph_of_mark-1.1.0-py3-none-any.whl (369.2 kB view details)

Uploaded Jan 23, 2026 Python 3

File details

Details for the file graph_of_mark-1.1.0.tar.gz.

File metadata

Download URL: graph_of_mark-1.1.0.tar.gz
Upload date: Jan 23, 2026
Size: 331.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for graph_of_mark-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f9c510dae69fe2fbdd7383321e3aa4a0a568864fe214bfce46efcb83c6050714`
MD5	`7214d5f3776d0363eb65c019bda9a2b5`
BLAKE2b-256	`a2f89ad06ee723741706a391513194abf48e2d4b5ad03f17cb16ba11aa22a6a5`

See more details on using hashes here.

File details

Details for the file graph_of_mark-1.1.0-py3-none-any.whl.

File metadata

Download URL: graph_of_mark-1.1.0-py3-none-any.whl
Upload date: Jan 23, 2026
Size: 369.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for graph_of_mark-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b6d6c44111149ee8f0e22151bd6b3cb0a667ddf615895044805ed82cf10a35f`
MD5	`d3428b34cf9a55886a322ab480eebc28`
BLAKE2b-256	`6c6334b25efa6dc6e962f630d6913f7dc0bdf42825360003b0a5723b37350fb1`

See more details on using hashes here.

graph-of-mark 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Graph of Mark (GoM)

Publication

Installation

From PyPI

From Source

Quick Start

Python API

Visual Prompting Styles

Command-Line Interface

Pipeline Overview

Return Dictionary

Configuration

Visual Prompting Styles (Paper Table 2)

Pipeline Parameters

Custom Model Integration

Examples

Docker

Repository Structure

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes