Real-Time Spatio-Semantic Memory for spatial AI and robotics
Project description
RTSM — Real-Time Spatio-Semantic Memory
RTSM is a real-time spatial memory system that turns RGB-D streams into a persistent, queryable 3D object-centric world state.
Instead of treating perception as disposable frames, RTSM maintains stable object identities over time, enabling robots and embodied agents to answer questions like:
- What objects exist in this space?
- Where are they right now?
- What changed, and when?
Watch Demo Video · Documentation
Why RTSM
Modern perception systems can detect objects, but they lack memory. SLAM systems build geometry, vision models detect semantics, and language models reason abstractly—but there is no shared layer that connects space, objects, and history.
RTSM fills this gap by acting as an explicit spatial memory layer:
- SLAM provides geometry and poses
- Vision models provide object masks and semantics
- RTSM fuses them into a persistent world representation
This makes spatial state inspectable, queryable, and reusable across robots, agents, and applications.
What RTSM Does
- Builds a live 3D map from RGB-D + pose streams
- Assigns persistent IDs to objects across viewpoints and time
- Stores spatial, semantic, and temporal metadata per object
- Supports semantic + spatial queries (e.g. "red bin near dock 3")
- Exposes a programmatic API and real-time 3D visualizer
RTSM is SLAM-agnostic and designed to sit above existing perception stacks.
Who This Is For
- Robotics and embodied AI researchers
- Developers building agentic or world-model-based systems
- Teams exploring persistent perception, spatial reasoning, or digital twins
Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ RTSM — Real-Time Spatio-Semantic Memory │
└──────────────────────────────────────────────────────────────────────────┘
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Calabi Lens │ │ D435i + SLAM │ │ Recorded │
│ (ARKit iOS) │ │ (RTABMap) │ │ Session │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ WebSocket │ ZeroMQ │ --replay
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────────┐
│ I/O Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WebSocket │ │ ZMQ Bridge │ │ Replay │ │ Recorder │ │
│ │ Receiver │ │ (sensors) │ │ Receiver │ │ (--record) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ └──────────────┘ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ┌──────▼───────┐ ┌──────────────┐ │
│ │ IngestQueue │────>│ FramePacket │ │
│ │ (buffer) │ │ (RGB,D,Pose) │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
└───────────────────────────────────────────────┼──────────────────────────┘
│
┌─────────────────────▼───────────────────────┐
│ Ingest Gate │
│ (keyframe priority, sweep-based skip) │
└─────────────────────┬───────────────────────┘
│
┌───────────────────────────────────────────────▼──────────────────────────┐
│ Perception Pipeline │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Grounding DINO │ │ SAM2 │ Default: grounded_sam2 │
│ │ (detection + │─>│ (box-prompted │ GDINO detects → SAM2 segments │
│ │ labels) │ │ masks) │ (Apache 2.0, no AGPL) │
│ └────────────────┘ └───────┬────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Mask Staging │────>│ Top-K Select │────>│ CLIP Encode │ │
│ │ (heuristics) │ │ (priority) │ │(224x224 crop)│ │
│ └───────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌───────────────┐ ┌─────▼────────┐ │
│ │ Vocab Classify│<───────│ Embeddings │ │
│ │ (label + conf)│ │ (512-D L2) │ │
│ └───────┬───────┘ └──────────────┘ │
│ │ │
└─────────────────────────────┼────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Association │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ Proximity │────>│ Embedding │────>│ Score Fusion │ │
│ │ Query │ │ Cosine Sim │ │ (match/create)│ │
│ └─────────────┘ └─────────────┘ └───────────────┘ │
│ │
└───────────────┬──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Working Memory │
│ │
│ ObjectState: │
│ - id, xyz_world (3D position) │
│ - emb_mean, emb_gallery (CLIP embeddings) │
│ - view_bins (multi-view fusion) │
│ - label_scores (EWMA label confidence) │
│ - stability, hits, confirmed │
│ - image_crops (JPEG snapshots) │
│ │
│ Proto -> Confirmed (hits >= 2, stability >= 0.55, views >= 1) │
│ │
└───────────────┬──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Long-Term Memory (FAISS / Milvus) │
│ │
│ Semantic Search: query(text) -> CLIP -> top-k objects │
│ │
└───────────────┬──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ API & Visualization │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ REST API │ │ WebSocket │ │ 3D Demo │ │
│ │ /objects │ │ point clouds │ │ (Three.js) │ │
│ │ /search │ │ objects_update │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Quick Start
Prerequisites
- Python 3.12+
- pip (or uv)
- CUDA-capable GPU (tested on RTX 3080, RTX 5090)
- One of:
- iPhone with Calabi Lens app (ARKit, no external SLAM needed)
- RGB-D camera (Intel RealSense D435i) + SLAM system (RTAB-Map)
- A recorded session (for replay — no hardware needed)
Tested on: WSL2 Ubuntu 22.04 with RTAB-Map, and macOS/Windows with Calabi Lens.
Installation
git clone https://github.com/calabi-inc/rtsm.git
cd rtsm
# Core only (API server, I/O transports — no GPU needed)
pip install .
# With GPU — permissive license (SAM2 + Grounding DINO, Apache 2.0)
pip install ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
# With GPU — ultralytics backends (FastSAM + YOLOE, AGPL-3.0)
pip install ".[gpu-ultralytics]" --extra-index-url https://download.pytorch.org/whl/cu128
# Everything (GPU + visualization)
pip install ".[all]" --extra-index-url https://download.pytorch.org/whl/cu128
License note:
rtsm[gpu]uses only Apache 2.0 / MIT dependencies.rtsm[gpu-ultralytics]adds theultralyticspackage (AGPL-3.0) for FastSAM and YOLOE backends.CUDA version: Use
cu128for most GPUs (RTX 3080–5090). For Blackwell-only features usecu130. See PyTorch install for other options.
Download Models
# Fetch default models (SAM2, Grounding DINO, CLIP)
python scripts/fetch_models.py
# Or fetch individually
python scripts/fetch_models.py --only sam2
python scripts/fetch_models.py --only gdino
python scripts/fetch_models.py --only clip
# Ultralytics models (only if you installed rtsm[gpu-ultralytics])
python scripts/fetch_models.py --only fastsam
python scripts/fetch_models.py --only yolo
Run
# Live — Calabi Lens (ARKit over WebSocket)
python -m rtsm
# Live — D435i + RTAB-Map (ZeroMQ)
# Set io.receiver: zeromq in config/rtsm.yaml first
python -m rtsm
# Replay a recorded session (no device needed)
python -m rtsm --replay recordings/session1
RTSM will start:
- Perception pipeline — processing frames
- REST API —
http://localhost:8000 - Visualization WebSocket —
ws://localhost:8081
Record & Replay
Record a session for offline testing and reproducible iteration:
# Record only (no GPU needed, no pipeline — works with core-only install)
python -m rtsm --record recordings/my_session --record-only
# Record while running pipeline
python -m rtsm --record recordings/my_session
# Replay at original recording rate
python -m rtsm --replay recordings/my_session
Recordings are self-contained directories with raw WebSocket data. Replay feeds the exact same bytes through the full decode + pipeline path, preserving all time-dependent behavior (TTL caches, throttles).
A/B Segmentation Debug
Compare FastSAM vs YOLOE segmentation on a recorded session:
# Generate side-by-side overlays (cached — skips existing frames)
python scripts/debug_segmentation.py --recording recordings/session1
# Open the viewer
# → debug/session1/compare.html (arrow keys to navigate)
API Examples
# List all objects
curl http://localhost:8000/objects
# Semantic search
curl "http://localhost:8000/search/semantic?query=red%20mug&top_k=5"
# Get system stats
curl http://localhost:8000/stats/detailed
# Runtime analytics (latency, throughput, dual confirmation rates)
curl http://localhost:8000/stats/analytics
Configuration
See config/rtsm.yaml for full configuration options:
- Camera intrinsics — focal length, resolution
- I/O endpoints — ZeroMQ addresses for camera and SLAM
- Pipeline tuning — mask filtering, association thresholds
- Memory settings — object promotion, expiry, vector store
Segmentation Backends
RTSM supports multiple segmentation backends via segmentation.backend in config/rtsm.yaml:
| Backend | License | Description | Seg time* | Pipeline total* | Labels |
|---|---|---|---|---|---|
grounded_sam2 |
Apache 2.0 | Grounding DINO detect + SAM2 segment | 222 ms | 510 ms | Open-vocab (text-prompted) |
sam2 |
Apache 2.0 | SAM2 auto-mask (segment everything) | ~860 ms | ~1000 ms | None (class-agnostic) |
fastsam |
AGPL-3.0 | FastSAM (segment everything) | ~50 ms | ~200 ms | None (class-agnostic) |
yoloe |
AGPL-3.0 | YOLOE detection + segmentation | ~60 ms | ~210 ms | Open-vocab / 1200+ built-in |
dual |
AGPL-3.0 | FastSAM + YOLOE with IoU merge | 116 ms | 210 ms | Dual-confirmed labels |
Mean on RTX 5090, 640x480 input. dual and grounded_sam2 measured via replay benchmark; others estimated.
Default: grounded_sam2 — permissive license, open-vocabulary, no AGPL dependency.
To switch backends, edit config/rtsm.yaml:
segmentation:
backend: grounded_sam2 # or: sam2, fastsam, yoloe, dual
fastsam,yoloe, anddualrequirepip install "rtsm[gpu-ultralytics]".
Project Structure
rtsm/
├── core/ # Pipeline, association, ingest gate, data models
├── models/ # SAM2, Grounding DINO, FastSAM, YOLOE, CLIP adapters
├── stores/ # Working memory, proximity index, sweep cache, vector stores
├── io/ # WebSocket + ZeroMQ receivers, recorder, replayer
├── analytics/ # Runtime analytics (latency, segmentation, congestion buffers)
├── api/ # REST API server (FastAPI)
├── visualization/ # WebSocket server, TSDF fusion, 3D demo
└── utils/ # Mask staging, transforms, helpers
config/
├── rtsm.yaml # Main configuration (models, thresholds, I/O)
└── clip/vocab.yaml # CLIP vocabulary
scripts/
├── fetch_models.py # Download all models (SAM2, GDINO, CLIP, FastSAM, YOLOE)
├── debug_segmentation.py # A/B segmentation viewer (FastSAM vs YOLOE)
└── benchmark_backends.py # Backend comparison benchmark (generates reports/)
reports/ # Benchmark results and comparison reports
recordings/ # Recorded sessions for replay testing (git-lfs)
tests/ # Unit + integration tests
Performance
Benchmarked on RTX 5090 (32 GB), iPhone ARKit recording (162 frames, 458s indoor scene), 640x480 RGB input. Both backends run the same replay session through the identical 10-stage pipeline.
Backend Comparison
| Metric | dual (FastSAM + YOLOE) | grounded_sam2 (GDINO + SAM2) |
|---|---|---|
| Mean latency | 210 ms | 510 ms |
| P50 latency | 170 ms | 502 ms |
| P95 latency | 509 ms | 721 ms |
| Masks/frame | 28.8 | 13.4 |
| Objects confirmed | 60 | 35 |
| Confirmation rate | 52.2% | 45.5% |
| License | AGPL-3.0 | Apache-2.0 |
Per-stage breakdown, dual confirmation analysis, and full methodology: Benchmarks |
reports/backend_comparison.md
Roadmap
- Dual-confirmation segmentation (FastSAM + YOLOE)
- AGPL-clean default (SAM2 + Grounding DINO, Apache 2.0)
- YOLOE prompt-free (1200+ LVIS categories)
- WebSocket receiver for Calabi Lens (ARKit iOS)
- Record/replay system for offline testing
- A/B segmentation debug tooling
- Real-time analytics dashboard (Looker-style, per-stage latency, dual confirmation rates, congestion detection)
- Evaluation framework (ArUco ground truth, precision/recall metrics)
- Agent architecture (MCP interface)
- More communication protocols (ROS 2, gRPC)
- LLM integration for high-level queries (agentic mode)
- Dockerization
Acknowledgments
RTSM builds on excellent open-source work:
-
SAM 2 — Ravi et al., SAM 2: Segment Anything in Images and Videos, 2024. arXiv:2408.00714 · GitHub
-
Grounding DINO — Liu et al., Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023. arXiv:2303.05499 · GitHub
-
FastSAM — Zhao et al., Fast Segment Anything, 2023. arXiv:2306.12156 · GitHub
-
YOLOE — THU-MIG, YOLOE: Real-Time Seeing Anything, ICCV 2025. GitHub · Ultralytics
-
CLIP — Radford et al., Learning Transferable Visual Models From Natural Language Supervision, 2021. arXiv:2103.00020 · GitHub
-
RTAB-Map — Labbé & Michaud, RTAB-Map as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, Journal of Field Robotics, 2019. Paper · GitHub
Cite This
If you use RTSM in your research, please cite:
@software{chang2025rtsm,
author = {Chang, Chi Feng},
title = {{RTSM}: Real-Time Spatio-Semantic Memory},
year = {2025},
url = {https://github.com/calabi-inc/rtsm},
note = {Object-centric queryable memory for spatial AI and robotics}
}
Community
This project is under active development. If you have questions or run into issues, feel free to open an issue — I'm happy to help.
If you find RTSM useful, please consider giving it a star! I'm also looking for design partners — reach out to Calabi if you're interested in collaborating.
License
Apache-2.0
Author
Built by Chi Feng, Chang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rtsm-0.1.1.tar.gz.
File metadata
- Download URL: rtsm-0.1.1.tar.gz
- Upload date:
- Size: 22.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
932685da59ff6fa699aa01186e155b05b34dbf4396995230971a11f9fc695fef
|
|
| MD5 |
337114c8b9aa1f8703f31cbb28a8f916
|
|
| BLAKE2b-256 |
b51e5d68b8f10c3d8fa22e61e2b1c47acba1ff8ade9d2b9eeb478209574f7f2d
|
File details
Details for the file rtsm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rtsm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 22.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbf21466b989a3127cddeb3d6a9ba4525b95fb3f89860fc0bec2ed1b353bdc4a
|
|
| MD5 |
18affad91a02d7d9380f9ef18cca5ef4
|
|
| BLAKE2b-256 |
df459fc1cdc9270f986e805d25db5ae7f827bdc87730f52edc12ac334a148628
|