Real-Time Spatio-Semantic Memory for spatial AI and robotics

Project description

RTSM

RTSM — Real-Time Spatio-Semantic Memory

RTSM is a real-time spatial memory system that turns RGB-D streams into a persistent, queryable 3D object-centric world state.

Instead of treating perception as disposable frames, RTSM maintains stable object identities over time, enabling robots and embodied agents to answer questions like:

What objects exist in this space?
Where are they right now?
What changed, and when?

Watch Demo Video · Documentation

Why RTSM

Modern perception systems can detect objects, but they lack memory. SLAM systems build geometry, vision models detect semantics, and language models reason abstractly—but there is no shared layer that connects space, objects, and history.

RTSM fills this gap by acting as an explicit spatial memory layer:

SLAM provides geometry and poses
Vision models provide object masks and semantics
RTSM fuses them into a persistent world representation

This makes spatial state inspectable, queryable, and reusable across robots, agents, and applications.

What RTSM Does

Builds a live 3D map from RGB-D + pose streams
Assigns persistent IDs to objects across viewpoints and time
Stores spatial, semantic, and temporal metadata per object
Supports semantic + spatial queries (e.g. "red bin near dock 3")
Exposes a programmatic API and real-time 3D visualizer

RTSM is SLAM-agnostic and designed to sit above existing perception stacks.

Who This Is For

Robotics and embodied AI researchers
Developers building agentic or world-model-based systems
Teams exploring persistent perception, spatial reasoning, or digital twins

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                 RTSM — Real-Time Spatio-Semantic Memory                  │
└──────────────────────────────────────────────────────────────────────────┘

  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │  Calabi Lens     │   │   D435i + SLAM   │   │  Recorded        │
  │  (ARKit iOS)     │   │   (RTABMap)      │   │  Session         │
  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘
           │ WebSocket            │ ZeroMQ               │ --replay
           ▼                      ▼                       ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  I/O Layer                                                               │
│                                                                          │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │  WebSocket  │  │  ZMQ Bridge │  │   Replay     │  │  Recorder    │   │
│  │  Receiver   │  │  (sensors)  │  │  Receiver    │  │  (--record)  │   │
│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘  └──────────────┘   │
│         └────────────────┴────────────────┘                              │
│                          │                                               │
│                   ┌──────▼───────┐     ┌──────────────┐                  │
│                   │ IngestQueue  │────>│ FramePacket  │                  │
│                   │  (buffer)    │     │ (RGB,D,Pose) │                  │
│                   └──────────────┘     └──────┬───────┘                  │
│                                               │                          │
└───────────────────────────────────────────────┼──────────────────────────┘
                                                │
                          ┌─────────────────────▼───────────────────────┐
                          │              Ingest Gate                    │
                          │   (keyframe priority, sweep-based skip)     │
                          └─────────────────────┬───────────────────────┘
                                                │
┌───────────────────────────────────────────────▼──────────────────────────┐
│  Perception Pipeline                                                     │
│                                                                          │
│  ┌────────────────┐  ┌────────────────┐                                   │
│  │ Grounding DINO │  │     SAM2       │    Default: grounded_sam2          │
│  │ (detection +   │─>│ (box-prompted  │    GDINO detects → SAM2 segments   │
│  │  labels)       │  │  masks)        │    (Apache 2.0, no AGPL)           │
│  └────────────────┘  └───────┬────────┘                                   │
│                              │                                            │
│                 ▼                                                        │
│  ┌───────────────┐     ┌──────────────┐     ┌──────────────┐             │
│  │ Mask Staging  │────>│ Top-K Select │────>│ CLIP Encode  │             │
│  │ (heuristics)  │     │  (priority)  │     │(224x224 crop)│             │
│  └───────────────┘     └──────────────┘     └──────┬───────┘             │
│                                                    │                     │
│                     ┌───────────────┐        ┌─────▼────────┐            │
│                     │ Vocab Classify│<───────│  Embeddings  │            │
│                     │ (label + conf)│        │  (512-D L2)  │            │
│                     └───────┬───────┘        └──────────────┘            │
│                             │                                            │
└─────────────────────────────┼────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Association                                                             │
│                                                                          │
│  ┌─────────────┐     ┌─────────────┐     ┌───────────────┐               │
│  │  Proximity  │────>│  Embedding  │────>│  Score Fusion │               │
│  │   Query     │     │  Cosine Sim │     │ (match/create)│               │
│  └─────────────┘     └─────────────┘     └───────────────┘               │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Working Memory                                                          │
│                                                                          │
│  ObjectState:                                                            │
│    - id, xyz_world (3D position)                                         │
│    - emb_mean, emb_gallery (CLIP embeddings)                             │
│    - view_bins (multi-view fusion)                                       │
│    - label_scores (EWMA label confidence)                                │
│    - stability, hits, confirmed                                          │
│    - image_crops (JPEG snapshots)                                        │
│                                                                          │
│  Proto -> Confirmed (hits >= 2, stability >= 0.55, views >= 1)           │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Long-Term Memory (FAISS / Milvus)                                       │
│                                                                          │
│  Semantic Search: query(text) -> CLIP -> top-k objects                   │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  API & Visualization                                                     │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │    REST API     │  │    WebSocket    │  │     3D Demo     │           │
│  │    /objects     │  │  point clouds   │  │    (Three.js)   │           │
│  │    /search      │  │  objects_update │  │                 │           │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘           │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.12+
pip (or uv)
CUDA-capable GPU (tested on RTX 3080, RTX 5090)
One of:
- iPhone with Calabi Lens app (ARKit, no external SLAM needed)
- RGB-D camera (Intel RealSense D435i) + SLAM system (RTAB-Map)
- A recorded session (for replay — no hardware needed)

Tested on: WSL2 Ubuntu 22.04 with RTAB-Map, and macOS/Windows with Calabi Lens.

Installation

git clone https://github.com/calabi-inc/rtsm.git
cd rtsm

# Core only (API server, I/O transports — no GPU needed)
pip install .

# With GPU — permissive license (SAM2 + Grounding DINO, Apache 2.0)
pip install ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128

# With GPU — ultralytics backends (FastSAM + YOLOE, AGPL-3.0)
pip install ".[gpu-ultralytics]" --extra-index-url https://download.pytorch.org/whl/cu128

# Everything (GPU + visualization)
pip install ".[all]" --extra-index-url https://download.pytorch.org/whl/cu128

License note: rtsm[gpu] uses only Apache 2.0 / MIT dependencies. rtsm[gpu-ultralytics] adds the ultralytics package (AGPL-3.0) for FastSAM and YOLOE backends.

CUDA version: Use cu128 for most GPUs (RTX 3080–5090). For Blackwell-only features use cu130. See PyTorch install for other options.

Download Models

# Fetch default models (SAM2, Grounding DINO, CLIP)
python scripts/fetch_models.py

# Or fetch individually
python scripts/fetch_models.py --only sam2
python scripts/fetch_models.py --only gdino
python scripts/fetch_models.py --only clip

# Ultralytics models (only if you installed rtsm[gpu-ultralytics])
python scripts/fetch_models.py --only fastsam
python scripts/fetch_models.py --only yolo

Run

# Live — Calabi Lens (ARKit over WebSocket)
python -m rtsm

# Live — D435i + RTAB-Map (ZeroMQ)
# Set io.receiver: zeromq in config/rtsm.yaml first
python -m rtsm

# Replay a recorded session (no device needed)
python -m rtsm --replay recordings/session1

RTSM will start:

Perception pipeline — processing frames
REST API — http://localhost:8000
Visualization WebSocket — ws://localhost:8081

Record & Replay

Record a session for offline testing and reproducible iteration:

# Record only (no GPU needed, no pipeline — works with core-only install)
python -m rtsm --record recordings/my_session --record-only

# Record while running pipeline
python -m rtsm --record recordings/my_session

# Replay at original recording rate
python -m rtsm --replay recordings/my_session

Recordings are self-contained directories with raw WebSocket data. Replay feeds the exact same bytes through the full decode + pipeline path, preserving all time-dependent behavior (TTL caches, throttles).

A/B Segmentation Debug

Compare FastSAM vs YOLOE segmentation on a recorded session:

# Generate side-by-side overlays (cached — skips existing frames)
python scripts/debug_segmentation.py --recording recordings/session1

# Open the viewer
# → debug/session1/compare.html (arrow keys to navigate)

API Examples

# List all objects
curl http://localhost:8000/objects

# Semantic search
curl "http://localhost:8000/search/semantic?query=red%20mug&top_k=5"

# Get system stats
curl http://localhost:8000/stats/detailed

# Runtime analytics (latency, throughput, dual confirmation rates)
curl http://localhost:8000/stats/analytics

Configuration

See config/rtsm.yaml for full configuration options:

Camera intrinsics — focal length, resolution
I/O endpoints — ZeroMQ addresses for camera and SLAM
Pipeline tuning — mask filtering, association thresholds
Memory settings — object promotion, expiry, vector store

Segmentation Backends

RTSM supports multiple segmentation backends via segmentation.backend in config/rtsm.yaml:

Backend	License	Description	Seg time*	Pipeline total*	Labels
`grounded_sam2`	Apache 2.0	Grounding DINO detect + SAM2 segment	222 ms	510 ms	Open-vocab (text-prompted)
`sam2`	Apache 2.0	SAM2 auto-mask (segment everything)	~860 ms	~1000 ms	None (class-agnostic)
`fastsam`	AGPL-3.0	FastSAM (segment everything)	~50 ms	~200 ms	None (class-agnostic)
`yoloe`	AGPL-3.0	YOLOE detection + segmentation	~60 ms	~210 ms	Open-vocab / 1200+ built-in
`dual`	AGPL-3.0	FastSAM + YOLOE with IoU merge	116 ms	210 ms	Dual-confirmed labels

Mean on RTX 5090, 640x480 input. dual and grounded_sam2 measured via replay benchmark; others estimated.

Default: grounded_sam2 — permissive license, open-vocabulary, no AGPL dependency.

To switch backends, edit config/rtsm.yaml:

segmentation:
  backend: grounded_sam2    # or: sam2, fastsam, yoloe, dual

fastsam, yoloe, and dual require pip install "rtsm[gpu-ultralytics]".

Project Structure

rtsm/
├── core/           # Pipeline, association, ingest gate, data models
├── models/         # SAM2, Grounding DINO, FastSAM, YOLOE, CLIP adapters
├── stores/         # Working memory, proximity index, sweep cache, vector stores
├── io/             # WebSocket + ZeroMQ receivers, recorder, replayer
├── analytics/      # Runtime analytics (latency, segmentation, congestion buffers)
├── api/            # REST API server (FastAPI)
├── visualization/  # WebSocket server, TSDF fusion, 3D demo
└── utils/          # Mask staging, transforms, helpers
config/
├── rtsm.yaml       # Main configuration (models, thresholds, I/O)
└── clip/vocab.yaml  # CLIP vocabulary
scripts/
├── fetch_models.py          # Download all models (SAM2, GDINO, CLIP, FastSAM, YOLOE)
├── debug_segmentation.py    # A/B segmentation viewer (FastSAM vs YOLOE)
└── benchmark_backends.py    # Backend comparison benchmark (generates reports/)
reports/                     # Benchmark results and comparison reports
recordings/                  # Recorded sessions for replay testing (git-lfs)
tests/                       # Unit + integration tests

Performance

Benchmarked on RTX 5090 (32 GB), iPhone ARKit recording (162 frames, 458s indoor scene), 640x480 RGB input. Both backends run the same replay session through the identical 10-stage pipeline.

Backend Comparison

Metric	dual (FastSAM + YOLOE)	grounded_sam2 (GDINO + SAM2)
Mean latency	210 ms	510 ms
P50 latency	170 ms	502 ms
P95 latency	509 ms	721 ms
Masks/frame	28.8	13.4
Objects confirmed	60	35
Confirmation rate	52.2%	45.5%
License	AGPL-3.0	Apache-2.0

Per-stage breakdown, dual confirmation analysis, and full methodology: Benchmarks | reports/backend_comparison.md

Roadmap

Dual-confirmation segmentation (FastSAM + YOLOE)
AGPL-clean default (SAM2 + Grounding DINO, Apache 2.0)
YOLOE prompt-free (1200+ LVIS categories)
WebSocket receiver for Calabi Lens (ARKit iOS)
Record/replay system for offline testing
A/B segmentation debug tooling
Real-time analytics dashboard (Looker-style, per-stage latency, dual confirmation rates, congestion detection)
Evaluation framework (ArUco ground truth, precision/recall metrics)
Agent architecture (MCP interface)
More communication protocols (ROS 2, gRPC)
LLM integration for high-level queries (agentic mode)
Dockerization

Acknowledgments

RTSM builds on excellent open-source work:

SAM 2 — Ravi et al., SAM 2: Segment Anything in Images and Videos, 2024. arXiv:2408.00714 · GitHub
Grounding DINO — Liu et al., Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023. arXiv:2303.05499 · GitHub
FastSAM — Zhao et al., Fast Segment Anything, 2023. arXiv:2306.12156 · GitHub
YOLOE — THU-MIG, YOLOE: Real-Time Seeing Anything, ICCV 2025. GitHub · Ultralytics
CLIP — Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021. arXiv:2103.00020 · GitHub
RTAB-Map — Labbé & Michaud, RTAB-Map as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, Journal of Field Robotics, 2019. Paper · GitHub

Cite This

If you use RTSM in your research, please cite:

@software{chang2025rtsm,
  author       = {Chang, Chi Feng},
  title        = {{RTSM}: Real-Time Spatio-Semantic Memory},
  year         = {2025},
  url          = {https://github.com/calabi-inc/rtsm},
  note         = {Object-centric queryable memory for spatial AI and robotics}
}

Community

This project is under active development. If you have questions or run into issues, feel free to open an issue — I'm happy to help.

If you find RTSM useful, please consider giving it a star! I'm also looking for design partners — reach out to Calabi if you're interested in collaborating.

License

Apache-2.0

Author

Built by Chi Feng, Chang

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtsm-0.1.1.tar.gz (22.0 MB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rtsm-0.1.1-py3-none-any.whl (22.0 MB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file rtsm-0.1.1.tar.gz.

File metadata

Download URL: rtsm-0.1.1.tar.gz
Upload date: Apr 14, 2026
Size: 22.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rtsm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`932685da59ff6fa699aa01186e155b05b34dbf4396995230971a11f9fc695fef`
MD5	`337114c8b9aa1f8703f31cbb28a8f916`
BLAKE2b-256	`b51e5d68b8f10c3d8fa22e61e2b1c47acba1ff8ade9d2b9eeb478209574f7f2d`

See more details on using hashes here.

File details

Details for the file rtsm-0.1.1-py3-none-any.whl.

File metadata

Download URL: rtsm-0.1.1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 22.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rtsm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dbf21466b989a3127cddeb3d6a9ba4525b95fb3f89860fc0bec2ed1b353bdc4a`
MD5	`18affad91a02d7d9380f9ef18cca5ef4`
BLAKE2b-256	`df459fc1cdc9270f986e805d25db5ae7f827bdc87730f52edc12ac334a148628`

See more details on using hashes here.

rtsm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

RTSM — Real-Time Spatio-Semantic Memory

Why RTSM

What RTSM Does

Who This Is For

Architecture

Quick Start

Prerequisites

Installation

Download Models

Run

Record & Replay

A/B Segmentation Debug

API Examples

Configuration

Segmentation Backends

Project Structure

Performance

Backend Comparison

Roadmap

Acknowledgments

Cite This

Community

License

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes