NVIDIA Cosmos provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence
Project description
strands-cosmos
NVIDIA Cosmos for Strands Agents. Give your agent eyes that understand physics and hands that generate video, audio, and robot actions - on local compute.
4 model providers (Cosmos 3 omnimodal Reasoner & Generator + Cosmos-Reason2 VLM) and 45 tools spanning the full pipeline: reasoning, generation, curation, post-training, quantization, edge deployment, and evaluation.
⏱️ Learn it in 90 seconds
1. Install (we use uv everywhere):
uv pip install strands-cosmos
2. Understand video - the reasoner reads vision and reasons in text. It talks to a local vLLM server - see the Run the reasoner server dropdown just below to start one, then:
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")
3. Generate video - the generator runs in-process (Diffusers, no server):
from strands_cosmos import Cosmos3GeneratorModel
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video", prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.", out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video", prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")
That's the whole loop: understand → generate. Everything below is depth - expand only what you need.
| You want to… | Provider | Where |
|---|---|---|
| Understand video/image (text+vision → text) | Cosmos3ReasonerModel |
vLLM server |
| Generate image/video/audio/action | Cosmos3GeneratorModel |
in-process Diffusers |
| Run a tiny VLM on Jetson edge | CosmosVisionModel, CosmosModel |
Edge VLM |
| Drive the full pipeline from tools | 45 cosmos* tools |
Tools |
🚀 Run the reasoner server (one-time setup + serve)
The reasoner needs a vLLM server. Build a CUDA-matched env and serve Cosmos3-Nano on :8000:
just c3-doctor # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason # build the reasoner venv: vllm + vllm-cosmos3 (uv-managed)
just c3-serve-reason # serve Cosmos3-Nano on :8000 (--max-model-len 32768)
Verify it's up before pointing an agent at it:
curl -s http://localhost:8000/v1/models # → {"data":[{"id":"nvidia/Cosmos3-Nano",...}]}
Or one-shot a caption straight from the justfile (no Python):
just c3-reason "Caption in detail." "" scene.mp4 caption
CUDA pairing: match torch to your driver - CUDA 13 →
cu130+vllm==0.21.0; CUDA 12.8 →cu128+vllm==0.19.1.just c3-doctorreports your driver's recommendation.Single-GPU note: reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't co-fit on one ~46GB GPU - stop one before running the other, or use separate GPUs.
📦 Install matrix (pick the extra for your task)
uv pip install strands-cosmos # core: Reason2 VLM + all tools
uv pip install "strands-cosmos[cosmos3]" # + Cosmos 3 reasoner client (vLLM server)
uv pip install "strands-cosmos[cosmos3-gen]" # + Cosmos 3 generator (in-proc Diffusers: image/video/sound)
uv pip install "strands-cosmos[vllm]" # + bundled vLLM + openai client
uv pip install "strands-cosmos[all]" # everything (heavy)
The generator needs the diffusers dev build (Cosmos3OmniPipeline); PyPI forbids
direct-URL deps, so pin it at install time (or just use just c3-setup-gen):
uv pip install -U "git+https://github.com/huggingface/diffusers.git"
| Extra | Pulls in | For |
|---|---|---|
| (none) | transformers, torch, torchvision, torchcodec, av | Reason2 VLM + tools |
cosmos3 |
openai |
Cosmos 3 reasoner client |
cosmos3-gen |
diffusers, cosmos_guardrail, soundfile, imageio | Cosmos 3 generator |
vllm |
vllm, openai | self-hosting vLLM |
jetson |
torchcodec | Jetson companions (torch via JetPack) |
all |
all of the above + dev tools | kitchen sink |
🛠️ Developer setup (clone + build everything)
git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full # system deps, Python deps, clones all Cosmos repos (uv-managed venvs)
just doctor # verify everything
# dedicated, CUDA-matched envs (each is its own uv venv):
just c3-setup-reason # reasoner: vllm + vllm-cosmos3
just c3-setup-gen # generator: diffusers(main) + cosmos_guardrail
just c3-setup-framework # action + training: Cosmos Framework
Run/lint/test against the dev env via uv:
uv pip install -e ".[dev]"
uv run pytest
uv run ruff check .
📜 Single-file script with inline deps (PEP 723 + uv run)
Drop dependencies directly into a script's header - uv run builds an ephemeral env, no
manual install. Save as agent.py:
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "strands-agents[openai]",
# "strands-cosmos",
# ]
# ///
import os, sys
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, cosmos3_caption, cosmos3_temporal, video_probe
model = Cosmos3ReasonerModel(base_url=os.environ.get("COSMOS_BASE_URL", "http://localhost:8000/v1"))
agent = Agent(model=model, tools=[cosmos3_caption, cosmos3_temporal, video_probe])
agent(" ".join(sys.argv[1:]) or "Caption in detail: <video>scene.mp4</video>")
uv run agent.py "List the events with timestamps: <video>scene.mp4</video>"
🌌 Cosmos 3 - Omnimodal World Models
Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces - the Reasoner (vLLM, text+vision → text) and the Generator (Diffusers, → image/video/audio/action).
See it end-to-end: Reason → Generate
Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (one with synchronized audio) from its own description - all on a single local GPU.
| ① Input video | ② Cosmos 3 understands it |
|---|---|
|
The reasoner distills its own understanding into a generation prompt:
"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."
Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):
| text → video | text → video + 🔊 sound | image → video |
|---|---|---|
| 55.5s | 43.2s · AAC stereo 48kHz | 42.1s · from a real frame |
→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with uv run examples/09_cosmos3_showcase.py
Capabilities & tool map
| Surface | Tools | Backend |
|---|---|---|
| Reasoner | cosmos3_reason, cosmos3_caption, cosmos3_temporal, cosmos3_embodied, cosmos3_ground, cosmos3_plausibility, cosmos3_situation, cosmos3_action_cot |
vLLM |
| Generator | cosmos3_text2image, cosmos3_text2video, cosmos3_image2video, cosmos3_text2video_sound |
Diffusers Cosmos3OmniPipeline (in-proc) |
| Video-to-video | cosmos3_video2video (transfer: day→night, recolor, restyle) |
vLLM-Omni Docker (vllm/vllm-omni:cosmos3) |
| Action / World-Model | cosmos3_forward_dynamics, cosmos3_inverse_dynamics, cosmos3_policy |
Cosmos Framework (torchrun) |
| Training (SFT) | cosmos3_train, cosmos3_train_convert, cosmos3_train_show, cosmos3_train_export, … |
Cosmos Framework (torchrun) |
| Servers | cosmos3_serve |
start / stop / status |
just c3-setup-gen # generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4
just c3-setup-framework # action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out # forward/inverse dynamics, policy
just c3-train-recipes # list SFT recipes
just c3-train vision_sft_nano # fine-tune (8x H100); see the training guide
📖 Full guides: Cosmos 3 · Training/SFT
Models
| Model | Size | Capability |
|---|---|---|
| Cosmos3-Nano | 16B | Omnimodal (reasoner + generator + action) - fits a single ~46GB GPU |
| Cosmos3-Super | 64B | Frontier-scale (multi-GPU / tensor-parallel) |
| Cosmos3-Nano-Policy-DROID | 16B | VL robot policy (DROID) |
🤖 Cosmos-Reason2 - Lightweight Edge VLM
For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint - verified on Jetson AGX Thor with Chain-of-Thought reasoning.
from strands import Agent
from strands_cosmos import CosmosVisionModel
agent = Agent(model=CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B"))
agent("Caption in detail: <video>dashcam.mp4</video>") # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?") # image reasoning
agent("What happens when a ball rolls off a table?") # text-only physics
Demo, models & performance
Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor
| Model | GPU Memory | Use Case |
|---|---|---|
| Cosmos-Reason2-2B | 24GB | Edge deployment (Jetson Thor/Orin) |
| Cosmos-Reason2-8B | 32GB | Cloud/desktop high-accuracy |
Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.
Jetson install:
uv pip install strands-cosmos
strands-cosmos-fix-cublas # fix CUBLAS for Jetson GPU architecture
🧰 Tools
Use any of the 45 tools inside a Strands Agent for full-pipeline automation:
from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo
agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")
Full tool catalog (Reason2 / Predict / Transfer / lifecycle / data / eval)
| Category | Tools | Description |
|---|---|---|
| Reason2 VLM | cosmos_inference, cosmos_reason_hf, cosmos_serve |
TRT server inference, HF direct inference, server lifecycle |
| Predict 2.5 | cosmos_predict_generate |
World-model video generation (future frame prediction) |
| Transfer 2.5 | cosmos_transfer_generate |
ControlNet video-to-video (depth/edge/sketch→video) |
| Model Lifecycle | cosmos_model_download, cosmos_quantize, cosmos_export_onnx, cosmos_build_engine |
Download, FP8 quantize, ONNX export, TRT engine build |
| Training | cosmos_post_train, cosmos_distill |
SFT/LoRA post-training, knowledge distillation |
| Data | cosmos_curate |
Xenna data curation pipeline |
| Evaluation | cosmos_evaluate |
FID/FVD/CSE/CLIP benchmark evaluation |
| I/O | rtp_capture_frame, nats_publish, video_probe, video_extract_frames, image_read |
RTP capture, NATS messaging, video/image utilities |
| System | cosmos_sysinfo |
GPU/platform diagnostics |
⚙️ Configuration
# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)
# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)
# Cosmos-Reason2 VLM
CosmosVisionModel(
model_id="nvidia/Cosmos-Reason2-8B",
reasoning=True, # Chain-of-thought <think>...</think>
fps=4,
params={"max_tokens": 4096, "temperature": 0.6},
)
🏗️ Architecture
strands_cosmos/
├── cosmos3_reasoner_model.py # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py # CosmosModel (Reason2 text-only)
├── fix_cublas.py # Jetson CUBLAS compatibility fix
├── tools/
│ ├── cosmos3.py # 16 Cosmos 3 tools (reason/generate/action/serve)
│ ├── inference.py · reason_hf.py · serve.py # Reason2 VLM
│ ├── predict_generate.py · transfer_generate.py # Predict2.5 / Transfer2.5
│ ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│ ├── post_train.py · distill.py · curate.py · evaluate.py
│ └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile # Developer workflow + c3-* recipes
✅ Verified platforms
| Platform | GPU | Status |
|---|---|---|
| Desktop / Cloud | NVIDIA L40S / A100 / H100 / RTX 4090 | ✅ Cosmos 3 + Reason2 |
| Jetson AGX Thor | NVIDIA Thor 132GB | ✅ Reason2 (with CUBLAS fix) |
| Jetson Orin | 32/64GB | ✅ Reason2 (may need CUBLAS fix) |
🩺 Troubleshooting
CUBLAS_STATUS_INVALID_VALUE on Jetson
strands-cosmos-fix-cublas # replaces torch's bundled CUBLAS with JetPack system CUBLAS
Cosmos 3 reasoner OOM on a single GPU - the default sequence length (262K) needs a huge
KV cache. just c3-serve-reason caps it at --max-model-len 32768. Stop the generator before
serving the reasoner (and vice versa).
StopIteration in get_rope_index during video (Reason2) - already handled;
strands-cosmos pins a compatible transformers range. If you still see it:
uv pip install "transformers>=4.57.0,<5.3.0"
module 'torchvision.io' has no attribute 'read_video' - transformers 5.x decodes video
with torchcodec and falls back to torchvision, which removed io.read_video in >=0.27:
uv pip install torchcodec
TRT tools return exit 127 - expected on workstations; those run on Jetson or in TRT Docker.
Run just doctor.
Resources
- Changelog - Release history
- Cosmos 3 - Latest omnimodal world models
- Cosmos Cookbook - Official recipes
- Cosmos-Reason2 - VLM source
- Strands Agents - Agent framework
License
Licensed under the Apache License 2.0. See NOTICE for attribution and SECURITY.md for vulnerability reporting.
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | Built with NVIDIA Cosmos and Strands Agents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strands_for_cosmos-0.6.0.tar.gz.
File metadata
- Download URL: strands_for_cosmos-0.6.0.tar.gz
- Upload date:
- Size: 102.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3b24175e288d57a9de537a27733bd4bc32a2a0c7e3169c18d2bab9cbadcd62c
|
|
| MD5 |
71b4c5f13ddb8b143bc41a225de7f6b8
|
|
| BLAKE2b-256 |
73fd18ccec9d316349b2ea406587df74b0f9d93078386436571ee2bb8bc606f4
|
File details
Details for the file strands_for_cosmos-0.6.0-py3-none-any.whl.
File metadata
- Download URL: strands_for_cosmos-0.6.0-py3-none-any.whl
- Upload date:
- Size: 109.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c9831ee1e672fd44b4ffdd69aea91b7c217a8a3ba083f854fe1f5d8b96d11bd
|
|
| MD5 |
645549b5c7041149a787c4d55302d193
|
|
| BLAKE2b-256 |
864f1c7a4a6dd8c60e8b73e34a27822abe533cca3a2eb6e4b870c82a49cbfaed
|