NVIDIA Cosmos Reason VLM provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence
Project description
strands-cosmos
NVIDIA Cosmos toolkit for Strands Agents — from VLM reasoning to world-model generation, edge deployment, and evaluation.
Provides 4 Strands model providers (Cosmos-Reason2 VLM + the new Cosmos 3 omnimodal Reasoner & Generator) plus 44 tools covering the entire NVIDIA Cosmos ecosystem: VLM reasoning, world-model generation (image/video/audio/action), video-to-video (Transfer2.5), data curation (Xenna), post-training, distillation, quantization, edge deployment, and evaluation. Local compute.
NVIDIA Cosmos toolkit for Strands Agents — omnimodal world-model reasoning and generation, on local compute.
Cosmos models become first-class Strands model providers — give your agent eyes that understand physics, and hands that can generate video, audio, and robot actions. Plus 44 tools spanning the full Cosmos pipeline (inference, generation, curation, post-training, quantization, edge deployment, evaluation).
| Family | Providers | Best for |
|---|---|---|
| Cosmos 3 (latest, omnimodal) | Cosmos3ReasonerModel, Cosmos3GeneratorModel |
Video/image/audio/action understanding + generation |
| Cosmos-Reason2 (VLM) | CosmosVisionModel, CosmosModel |
Lightweight edge VLM (Jetson Thor/Orin) |
🌌 Cosmos 3 — Omnimodal World Models
Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces:
- Reasoner (
Cosmos3ReasonerModel, vLLM) — text + vision → text - Generator (
Cosmos3GeneratorModel, Diffusers) — text/image → image/video/audio/action
See it end-to-end: Reason → Generate
Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (including one with synchronized audio) from its own description — all on a single local GPU.
| ① Input video | ② Cosmos 3 understands it |
|---|---|
— |
The reasoner distills its own understanding into a generation prompt:
"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."
Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):
| text → video | text → video + 🔊 sound | image → video |
|---|---|---|
| 55.5s | 43.2s · AAC stereo 48kHz | 42.1s · from a real frame |
→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with python examples/09_cosmos3_showcase.py
Quick start (Cosmos 3)
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, Cosmos3GeneratorModel
# Reasoner — text + vision -> text (local vLLM server; start with `just c3-serve-reason`)
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")
# Generator — text/image -> image/video/sound (in-process Diffusers, no server)
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video", prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.", out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video", prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")
Cosmos 3 capabilities
| Surface | Tools | Backend |
|---|---|---|
| Reasoner | cosmos3_reason, cosmos3_caption, cosmos3_temporal, cosmos3_embodied, cosmos3_ground, cosmos3_plausibility, cosmos3_situation, cosmos3_action_cot |
vLLM |
| Generator | cosmos3_text2image, cosmos3_text2video, cosmos3_image2video, cosmos3_text2video_sound |
Diffusers Cosmos3OmniPipeline (in-proc) |
| Action / World-Model | cosmos3_forward_dynamics, cosmos3_inverse_dynamics, cosmos3_policy |
Cosmos Framework (torchrun) |
| Training (SFT) | cosmos3_train, cosmos3_train_convert, cosmos3_train_show, cosmos3_train_export, … |
Cosmos Framework (torchrun) |
| Servers | cosmos3_serve |
start / stop / status |
Cosmos 3 models
| Model | Size | Capability |
|---|---|---|
| Cosmos3-Nano | 16B | Omnimodal (reasoner + generator + action) — fits a single ~46GB GPU |
| Cosmos3-Super | 64B | Frontier-scale (multi-GPU / tensor-parallel) |
| Cosmos3-Nano-Policy-DROID | 16B | VL robot policy (DROID) |
Setup
Install the generator extras in one shot (Diffusers + guardrail + audio):
pip install "strands-cosmos[cosmos3-gen]" # text/image -> image/video/sound
pip install -U "git+https://github.com/huggingface/diffusers.git" # Cosmos3OmniPipeline (dev build)
pip install "strands-cosmos[cosmos3]" # reasoner client (vLLM server)
Or use the justfile to build dedicated, CUDA-matched environments:
just c3-doctor # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason # Reasoner env: vllm + vllm-cosmos3
just c3-serve-reason # serve Cosmos3-Nano on :8000
just c3-reason "Caption in detail." "" scene.mp4 caption
just c3-setup-gen # Generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4
just c3-setup-framework # Action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out # forward/inverse dynamics, policy
just c3-train-recipes # list SFT recipes
just c3-train vision_sft_nano # fine-tune (8x H100); see the training guide
CUDA pairing: match the torch backend to your driver — CUDA 13 →
cu130+vllm==0.21.0; CUDA 12.8 →cu128+vllm==0.19.1.just c3-doctorreports your driver's recommendation.Single-GPU note: the reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't fit on one ~46GB GPU together — stop one before running the other, or use separate GPUs.
📖 Full guides: Cosmos 3 · Training/SFT
Install
pip install strands-cosmos
Developer Setup
git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full # Installs system deps, Python deps, clones all Cosmos repos
just doctor # Verify everything
NVIDIA Jetson (Thor, Orin, AGX)
pip install strands-cosmos
strands-cosmos-fix-cublas # Fix CUBLAS for Jetson GPU architecture
Cosmos-Reason2 (Lightweight Edge VLM)
For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint — verified on Jetson AGX Thor with Chain-of-Thought reasoning.
Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor
from strands import Agent
from strands_cosmos import CosmosVisionModel
model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)
agent("Caption in detail: <video>dashcam.mp4</video>") # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?") # image reasoning
agent("What happens when a ball rolls off a table?") # text-only physics
| Model | GPU Memory | Use Case |
|---|---|---|
| Cosmos-Reason2-2B | 24GB | Edge deployment (Jetson Thor/Orin) |
| Cosmos-Reason2-8B | 32GB | Cloud/desktop high-accuracy |
Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.
Pipeline Tools (Cosmos-Reason2 / Predict / Transfer)
Use any tool inside a Strands Agent for full Cosmos pipeline automation:
| Category | Tools | Description |
|---|---|---|
| Reason2 VLM | cosmos_inference, cosmos_reason_hf, cosmos_serve |
TRT server inference, HF direct inference, server lifecycle |
| Predict 2.5 | cosmos_predict_generate |
World-model video generation (future frame prediction) |
| Transfer 2.5 | cosmos_transfer_generate |
ControlNet video-to-video (depth/edge/sketch→video) |
| Model Lifecycle | cosmos_model_download, cosmos_quantize, cosmos_export_onnx, cosmos_build_engine |
Download, FP8 quantize, ONNX export, TRT engine build |
| Training | cosmos_post_train, cosmos_distill |
SFT/LoRA post-training, knowledge distillation |
| Data | cosmos_curate |
Xenna data curation pipeline |
| Evaluation | cosmos_evaluate |
FID/FVD/CSE/CLIP benchmark evaluation |
| I/O | rtp_capture_frame, nats_publish, video_probe, video_extract_frames, image_read |
RTP capture, NATS messaging, video/image utilities |
| System | cosmos_sysinfo |
GPU/platform diagnostics |
from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo
agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")
Architecture
strands_cosmos/
├── cosmos3_reasoner_model.py # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py # CosmosModel (Reason2 text-only)
├── fix_cublas.py # Jetson CUBLAS compatibility fix
├── tools/
│ ├── cosmos3.py # 16 Cosmos 3 tools (reason/generate/action/serve)
│ ├── inference.py · reason_hf.py · serve.py # Reason2 VLM
│ ├── predict_generate.py · transfer_generate.py # Predict2.5 / Transfer2.5
│ ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│ ├── post_train.py · distill.py · curate.py · evaluate.py
│ └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile # Developer workflow + c3-* recipes
Configuration
# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)
# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)
# Cosmos-Reason2 VLM
CosmosVisionModel(
model_id="nvidia/Cosmos-Reason2-8B",
reasoning=True, # Chain-of-thought <think>...</think>
fps=4,
params={"max_tokens": 4096, "temperature": 0.6},
)
Verified Platforms
| Platform | GPU | Status |
|---|---|---|
| Desktop / Cloud | NVIDIA L40S / A100 / H100 / RTX 4090 | ✅ Cosmos 3 + Reason2 |
| Jetson AGX Thor | NVIDIA Thor 132GB | ✅ Reason2 (with CUBLAS fix) |
| Jetson Orin | 32/64GB | ✅ Reason2 (may need CUBLAS fix) |
Troubleshooting
CUBLAS_STATUS_INVALID_VALUE on Jetson
strands-cosmos-fix-cublas # Replaces torch's bundled CUBLAS with JetPack system CUBLAS
Cosmos 3 reasoner OOM on a single GPU
The default sequence length (262K) needs a huge KV cache. Cap it: just c3-serve-reason
sets --max-model-len 32768. Stop the generator before serving the reasoner (and vice versa).
StopIteration in get_rope_index during video (Reason2)
Already handled — strands-cosmos pins a compatible transformers range. If you see it:
pip install "transformers>=4.57.0,<5.3.0"
Video caption fails with module 'torchvision.io' has no attribute 'read_video'
transformers 5.x decodes video with torchcodec and falls back to torchvision,
which removed io.read_video in >=0.27. Install torchcodec (now a dependency):
pip install torchcodec
TRT tools return exit 127
Expected on workstations — those tools run on Jetson or in TRT Docker. Run just doctor.
Resources
- Changelog — Release history
- Cosmos 3 — Latest omnimodal world models
- Cosmos Cookbook — Official recipes
- Cosmos-Reason2 — VLM source
- Strands Agents — Agent framework
- strands-mlx — Apple Silicon provider
License
Apache 2.0 | Built with NVIDIA Cosmos and Strands Agents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strands_cosmos-0.4.1.tar.gz.
File metadata
- Download URL: strands_cosmos-0.4.1.tar.gz
- Upload date:
- Size: 19.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c25e7a3ccc477030d618c36f191a45519c138ee1f20dd800ead958f2f9d84a5a
|
|
| MD5 |
da12ea71fa4c43862d7bdb73b950026e
|
|
| BLAKE2b-256 |
814c69086237e539af57dba95500b04e7ecaa8ef63c3d1dc83fd2f02a5e2b86a
|
File details
Details for the file strands_cosmos-0.4.1-py3-none-any.whl.
File metadata
- Download URL: strands_cosmos-0.4.1-py3-none-any.whl
- Upload date:
- Size: 66.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
020ed4fe3d47b0126ae737296c57656c89c2d4a98b9144fd158e9e2c9520ef5d
|
|
| MD5 |
c7e152e44b0f8e039f2dd4c2939a71a0
|
|
| BLAKE2b-256 |
fd4d2e28e41f6771c5cdf4e5e531d14e2cebbb055f3f9c7b9b95f017c88667ae
|