Skip to main content

Real-Time Spatio-Semantic Memory for spatial AI and robotics

Project description

RTSM

RTSM — Real-Time Spatio-Semantic Memory

RTSM is a real-time spatial memory system that turns RGB-D streams into a persistent, queryable 3D object-centric world state.

Instead of treating perception as disposable frames, RTSM maintains stable object identities over time, enabling robots and embodied agents to answer questions like:

  • What objects exist in this space?
  • Where are they right now?
  • What changed, and when?

Watch Demo Video · Documentation


Why RTSM

Modern perception systems can detect objects, but they lack memory. SLAM systems build geometry, vision models detect semantics, and language models reason abstractly—but there is no shared layer that connects space, objects, and history.

RTSM fills this gap by acting as an explicit spatial memory layer:

  • SLAM provides geometry and poses
  • Vision models provide object masks and semantics
  • RTSM fuses them into a persistent world representation

This makes spatial state inspectable, queryable, and reusable across robots, agents, and applications.


What RTSM Does

  • Builds a live 3D map from RGB-D + pose streams
  • Assigns persistent IDs to objects across viewpoints and time
  • Stores spatial, semantic, and temporal metadata per object
  • Supports semantic + spatial queries (e.g. "red bin near dock 3")
  • Exposes a programmatic API and real-time 3D visualizer

RTSM is SLAM-agnostic and designed to sit above existing perception stacks.


Who This Is For

  • Robotics and embodied AI researchers
  • Developers building agentic or world-model-based systems
  • Teams exploring persistent perception, spatial reasoning, or digital twins

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                 RTSM — Real-Time Spatio-Semantic Memory                  │
└──────────────────────────────────────────────────────────────────────────┘

  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │  Calabi Lens     │   │   D435i + SLAM   │   │  Recorded        │
  │  (ARKit iOS)     │   │   (RTABMap)      │   │  Session         │
  └────────┬─────────┘   └────────┬─────────┘   └────────┬─────────┘
           │ WebSocket            │ ZeroMQ               │ --replay
           ▼                      ▼                       ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  I/O Layer                                                               │
│                                                                          │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │  WebSocket  │  │  ZMQ Bridge │  │   Replay     │  │  Recorder    │   │
│  │  Receiver   │  │  (sensors)  │  │  Receiver    │  │  (--record)  │   │
│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘  └──────────────┘   │
│         └────────────────┴────────────────┘                              │
│                          │                                               │
│                   ┌──────▼───────┐     ┌──────────────┐                  │
│                   │ IngestQueue  │────>│ FramePacket  │                  │
│                   │  (buffer)    │     │ (RGB,D,Pose) │                  │
│                   └──────────────┘     └──────┬───────┘                  │
│                                               │                          │
└───────────────────────────────────────────────┼──────────────────────────┘
                                                │
                          ┌─────────────────────▼───────────────────────┐
                          │              Ingest Gate                    │
                          │   (keyframe priority, sweep-based skip)     │
                          └─────────────────────┬───────────────────────┘
                                                │
┌───────────────────────────────────────────────▼──────────────────────────┐
│  Perception Pipeline                                                     │
│                                                                          │
│  ┌────────────────┐  ┌────────────────┐                                   │
│  │ Grounding DINO │  │     SAM2       │    Default: grounded_sam2          │
│  │ (detection +   │─>│ (box-prompted  │    GDINO detects → SAM2 segments   │
│  │  labels)       │  │  masks)        │    (Apache 2.0, no AGPL)           │
│  └────────────────┘  └───────┬────────┘                                   │
│                              │                                            │
│                 ▼                                                        │
│  ┌───────────────┐     ┌──────────────┐     ┌──────────────┐             │
│  │ Mask Staging  │────>│ Top-K Select │────>│ CLIP Encode  │             │
│  │ (heuristics)  │     │  (priority)  │     │(224x224 crop)│             │
│  └───────────────┘     └──────────────┘     └──────┬───────┘             │
│                                                    │                     │
│                     ┌───────────────┐        ┌─────▼────────┐            │
│                     │ Vocab Classify│<───────│  Embeddings  │            │
│                     │ (label + conf)│        │  (512-D L2)  │            │
│                     └───────┬───────┘        └──────────────┘            │
│                             │                                            │
└─────────────────────────────┼────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Association                                                             │
│                                                                          │
│  ┌─────────────┐     ┌─────────────┐     ┌───────────────┐               │
│  │  Proximity  │────>│  Embedding  │────>│  Score Fusion │               │
│  │   Query     │     │  Cosine Sim │     │ (match/create)│               │
│  └─────────────┘     └─────────────┘     └───────────────┘               │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Working Memory                                                          │
│                                                                          │
│  ObjectState:                                                            │
│    - id, xyz_world (3D position)                                         │
│    - emb_mean, emb_gallery (CLIP embeddings)                             │
│    - view_bins (multi-view fusion)                                       │
│    - label_scores (EWMA label confidence)                                │
│    - stability, hits, confirmed                                          │
│    - image_crops (JPEG snapshots)                                        │
│                                                                          │
│  Proto -> Confirmed (hits >= 2, stability >= 0.55, views >= 1)           │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Long-Term Memory (FAISS / Milvus)                                       │
│                                                                          │
│  Semantic Search: query(text) -> CLIP -> top-k objects                   │
│                                                                          │
└───────────────┬──────────────────────────────────────────────────────────┘
                │
                ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  API & Visualization                                                     │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐           │
│  │    REST API     │  │    WebSocket    │  │     3D Demo     │           │
│  │    /objects     │  │  point clouds   │  │    (Three.js)   │           │
│  │    /search      │  │  objects_update │  │                 │           │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘           │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

  • Python 3.12+
  • pip (or uv)
  • CUDA-capable GPU (tested on RTX 3080, RTX 5090)
  • One of:
    • iPhone with Calabi Lens app (ARKit, no external SLAM needed)
    • RGB-D camera (Intel RealSense D435i) + SLAM system (RTAB-Map)
    • A recorded session (for replay — no hardware needed)

Tested on: WSL2 Ubuntu 22.04 with RTAB-Map, and macOS/Windows with Calabi Lens.

Installation

git clone https://github.com/calabi-inc/rtsm.git
cd rtsm

# Core only (API server, I/O transports — no GPU needed)
pip install .

# With GPU — permissive license (SAM2 + Grounding DINO, Apache 2.0)
pip install ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128

# With GPU — ultralytics backends (FastSAM + YOLOE, AGPL-3.0)
pip install ".[gpu-ultralytics]" --extra-index-url https://download.pytorch.org/whl/cu128

# Everything (GPU + visualization)
pip install ".[all]" --extra-index-url https://download.pytorch.org/whl/cu128

License note: rtsm[gpu] uses only Apache 2.0 / MIT dependencies. rtsm[gpu-ultralytics] adds the ultralytics package (AGPL-3.0) for FastSAM and YOLOE backends.

CUDA version: Use cu128 for most GPUs (RTX 3080–5090). For Blackwell-only features use cu130. See PyTorch install for other options.

Download Models

# Fetch default models (SAM2, Grounding DINO, CLIP)
python scripts/fetch_models.py

# Or fetch individually
python scripts/fetch_models.py --only sam2
python scripts/fetch_models.py --only gdino
python scripts/fetch_models.py --only clip

# Ultralytics models (only if you installed rtsm[gpu-ultralytics])
python scripts/fetch_models.py --only fastsam
python scripts/fetch_models.py --only yolo

Run

# Live — Calabi Lens (ARKit over WebSocket)
python -m rtsm

# Live — D435i + RTAB-Map (ZeroMQ)
# Set io.receiver: zeromq in config/rtsm.yaml first
python -m rtsm

# Replay a recorded session (no device needed)
python -m rtsm --replay recordings/session1

RTSM will start:

  • Perception pipeline — processing frames
  • REST APIhttp://localhost:8000
  • Visualization WebSocketws://localhost:8081

Record & Replay

Record a session for offline testing and reproducible iteration:

# Record only (no GPU needed, no pipeline — works with core-only install)
python -m rtsm --record recordings/my_session --record-only

# Record while running pipeline
python -m rtsm --record recordings/my_session

# Replay at original recording rate
python -m rtsm --replay recordings/my_session

Recordings are self-contained directories with raw WebSocket data. Replay feeds the exact same bytes through the full decode + pipeline path, preserving all time-dependent behavior (TTL caches, throttles).

A/B Segmentation Debug

Compare FastSAM vs YOLOE segmentation on a recorded session:

# Generate side-by-side overlays (cached — skips existing frames)
python scripts/debug_segmentation.py --recording recordings/session1

# Open the viewer
# → debug/session1/compare.html (arrow keys to navigate)

API Examples

# List all objects
curl http://localhost:8000/objects

# Semantic search
curl "http://localhost:8000/search/semantic?query=red%20mug&top_k=5"

# Get system stats
curl http://localhost:8000/stats/detailed

# Runtime analytics (latency, throughput, dual confirmation rates)
curl http://localhost:8000/stats/analytics

Configuration

See config/rtsm.yaml for full configuration options:

  • Camera intrinsics — focal length, resolution
  • I/O endpoints — ZeroMQ addresses for camera and SLAM
  • Pipeline tuning — mask filtering, association thresholds
  • Memory settings — object promotion, expiry, vector store

Segmentation Backends

RTSM supports multiple segmentation backends via segmentation.backend in config/rtsm.yaml:

Backend License Description Seg time* Pipeline total* Labels
grounded_sam2 Apache 2.0 Grounding DINO detect + SAM2 segment 222 ms 510 ms Open-vocab (text-prompted)
sam2 Apache 2.0 SAM2 auto-mask (segment everything) ~860 ms ~1000 ms None (class-agnostic)
fastsam AGPL-3.0 FastSAM (segment everything) ~50 ms ~200 ms None (class-agnostic)
yoloe AGPL-3.0 YOLOE detection + segmentation ~60 ms ~210 ms Open-vocab / 1200+ built-in
dual AGPL-3.0 FastSAM + YOLOE with IoU merge 116 ms 210 ms Dual-confirmed labels

Mean on RTX 5090, 640x480 input. dual and grounded_sam2 measured via replay benchmark; others estimated.

Default: grounded_sam2 — permissive license, open-vocabulary, no AGPL dependency.

To switch backends, edit config/rtsm.yaml:

segmentation:
  backend: grounded_sam2    # or: sam2, fastsam, yoloe, dual

fastsam, yoloe, and dual require pip install "rtsm[gpu-ultralytics]".


Project Structure

rtsm/
├── core/           # Pipeline, association, ingest gate, data models
├── models/         # SAM2, Grounding DINO, FastSAM, YOLOE, CLIP adapters
├── stores/         # Working memory, proximity index, sweep cache, vector stores
├── io/             # WebSocket + ZeroMQ receivers, recorder, replayer
├── analytics/      # Runtime analytics (latency, segmentation, congestion buffers)
├── api/            # REST API server (FastAPI)
├── visualization/  # WebSocket server, TSDF fusion, 3D demo
└── utils/          # Mask staging, transforms, helpers
config/
├── rtsm.yaml       # Main configuration (models, thresholds, I/O)
└── clip/vocab.yaml  # CLIP vocabulary
scripts/
├── fetch_models.py          # Download all models (SAM2, GDINO, CLIP, FastSAM, YOLOE)
├── debug_segmentation.py    # A/B segmentation viewer (FastSAM vs YOLOE)
└── benchmark_backends.py    # Backend comparison benchmark (generates reports/)
reports/                     # Benchmark results and comparison reports
recordings/                  # Recorded sessions for replay testing (git-lfs)
tests/                       # Unit + integration tests

Performance

Benchmarked on RTX 5090 (32 GB), iPhone ARKit recording (162 frames, 458s indoor scene), 640x480 RGB input. Both backends run the same replay session through the identical 10-stage pipeline.

Backend Comparison

Metric dual (FastSAM + YOLOE) grounded_sam2 (GDINO + SAM2)
Mean latency 210 ms 510 ms
P50 latency 170 ms 502 ms
P95 latency 509 ms 721 ms
Masks/frame 28.8 13.4
Objects confirmed 60 35
Confirmation rate 52.2% 45.5%
License AGPL-3.0 Apache-2.0

Per-stage breakdown, dual confirmation analysis, and full methodology: Benchmarks | reports/backend_comparison.md


Roadmap

  • Dual-confirmation segmentation (FastSAM + YOLOE)
  • AGPL-clean default (SAM2 + Grounding DINO, Apache 2.0)
  • YOLOE prompt-free (1200+ LVIS categories)
  • WebSocket receiver for Calabi Lens (ARKit iOS)
  • Record/replay system for offline testing
  • A/B segmentation debug tooling
  • Real-time analytics dashboard (Looker-style, per-stage latency, dual confirmation rates, congestion detection)
  • Evaluation framework (ArUco ground truth, precision/recall metrics)
  • Agent architecture (MCP interface)
  • More communication protocols (ROS 2, gRPC)
  • LLM integration for high-level queries (agentic mode)
  • Dockerization

Acknowledgments

RTSM builds on excellent open-source work:

  • SAM 2 — Ravi et al., SAM 2: Segment Anything in Images and Videos, 2024. arXiv:2408.00714 · GitHub

  • Grounding DINO — Liu et al., Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023. arXiv:2303.05499 · GitHub

  • FastSAM — Zhao et al., Fast Segment Anything, 2023. arXiv:2306.12156 · GitHub

  • YOLOE — THU-MIG, YOLOE: Real-Time Seeing Anything, ICCV 2025. GitHub · Ultralytics

  • CLIP — Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021. arXiv:2103.00020 · GitHub

  • RTAB-Map — Labbé & Michaud, RTAB-Map as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, Journal of Field Robotics, 2019. Paper · GitHub


Cite This

If you use RTSM in your research, please cite:

@software{chang2025rtsm,
  author       = {Chang, Chi Feng},
  title        = {{RTSM}: Real-Time Spatio-Semantic Memory},
  year         = {2025},
  url          = {https://github.com/calabi-inc/rtsm},
  note         = {Object-centric queryable memory for spatial AI and robotics}
}

Community

This project is under active development. If you have questions or run into issues, feel free to open an issue — I'm happy to help.

If you find RTSM useful, please consider giving it a star! I'm also looking for design partners — reach out to Calabi if you're interested in collaborating.


License

Apache-2.0


Author

Built by Chi Feng, Chang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtsm-0.1.1.tar.gz (22.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rtsm-0.1.1-py3-none-any.whl (22.0 MB view details)

Uploaded Python 3

File details

Details for the file rtsm-0.1.1.tar.gz.

File metadata

  • Download URL: rtsm-0.1.1.tar.gz
  • Upload date:
  • Size: 22.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rtsm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 932685da59ff6fa699aa01186e155b05b34dbf4396995230971a11f9fc695fef
MD5 337114c8b9aa1f8703f31cbb28a8f916
BLAKE2b-256 b51e5d68b8f10c3d8fa22e61e2b1c47acba1ff8ade9d2b9eeb478209574f7f2d

See more details on using hashes here.

File details

Details for the file rtsm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rtsm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rtsm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dbf21466b989a3127cddeb3d6a9ba4525b95fb3f89860fc0bec2ed1b353bdc4a
MD5 18affad91a02d7d9380f9ef18cca5ef4
BLAKE2b-256 df459fc1cdc9270f986e805d25db5ae7f827bdc87730f52edc12ac334a148628

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page