Skip to main content

NVIDIA Cosmos provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence

Project description

strands-cosmos

PyPI version Docs Awesome Strands Agents

Strands Cosmos

NVIDIA Cosmos for Strands Agents. Give your agent eyes that understand physics and hands that generate video, audio, and robot actions - on local compute.

4 model providers (Cosmos 3 omnimodal Reasoner & Generator + Cosmos-Reason2 VLM) and 45 tools spanning the full pipeline: reasoning, generation, curation, post-training, quantization, edge deployment, and evaluation.


⏱️ Learn it in 90 seconds

1. Install (we use uv everywhere):

uv pip install strands-cosmos

2. Understand video - the reasoner reads vision and reasons in text. It talks to a local vLLM server - see the Run the reasoner server dropdown just below to start one, then:

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel

agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")

3. Generate video - the generator runs in-process (Diffusers, no server):

from strands_cosmos import Cosmos3GeneratorModel

gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video",            prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.",           out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video",           prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")

That's the whole loop: understand → generate. Everything below is depth - expand only what you need.

You want to… Provider Where
Understand video/image (text+vision → text) Cosmos3ReasonerModel vLLM server
Generate image/video/audio/action Cosmos3GeneratorModel in-process Diffusers
Run a tiny VLM on Jetson edge CosmosVisionModel, CosmosModel Edge VLM
Drive the full pipeline from tools 45 cosmos* tools Tools

🚀 Run the reasoner server (one-time setup + serve)

The reasoner needs a vLLM server. Build a CUDA-matched env and serve Cosmos3-Nano on :8000:

just c3-doctor          # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason    # build the reasoner venv: vllm + vllm-cosmos3 (uv-managed)
just c3-serve-reason    # serve Cosmos3-Nano on :8000 (--max-model-len 32768)

Verify it's up before pointing an agent at it:

curl -s http://localhost:8000/v1/models    # → {"data":[{"id":"nvidia/Cosmos3-Nano",...}]}

Or one-shot a caption straight from the justfile (no Python):

just c3-reason "Caption in detail." "" scene.mp4 caption

CUDA pairing: match torch to your driver - CUDA 13 → cu130 + vllm==0.21.0; CUDA 12.8 → cu128 + vllm==0.19.1. just c3-doctor reports your driver's recommendation.

Single-GPU note: reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't co-fit on one ~46GB GPU - stop one before running the other, or use separate GPUs.

📦 Install matrix (pick the extra for your task)
uv pip install strands-cosmos                  # core: Reason2 VLM + all tools
uv pip install "strands-cosmos[cosmos3]"       # + Cosmos 3 reasoner client (vLLM server)
uv pip install "strands-cosmos[cosmos3-gen]"   # + Cosmos 3 generator (in-proc Diffusers: image/video/sound)
uv pip install "strands-cosmos[vllm]"          # + bundled vLLM + openai client
uv pip install "strands-cosmos[all]"           # everything (heavy)

The generator needs the diffusers dev build (Cosmos3OmniPipeline); PyPI forbids direct-URL deps, so pin it at install time (or just use just c3-setup-gen):

uv pip install -U "git+https://github.com/huggingface/diffusers.git"
Extra Pulls in For
(none) transformers, torch, torchvision, torchcodec, av Reason2 VLM + tools
cosmos3 openai Cosmos 3 reasoner client
cosmos3-gen diffusers, cosmos_guardrail, soundfile, imageio Cosmos 3 generator
vllm vllm, openai self-hosting vLLM
jetson torchcodec Jetson companions (torch via JetPack)
all all of the above + dev tools kitchen sink
🛠️ Developer setup (clone + build everything)
git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full    # system deps, Python deps, clones all Cosmos repos (uv-managed venvs)
just doctor        # verify everything

# dedicated, CUDA-matched envs (each is its own uv venv):
just c3-setup-reason      # reasoner: vllm + vllm-cosmos3
just c3-setup-gen         # generator: diffusers(main) + cosmos_guardrail
just c3-setup-framework   # action + training: Cosmos Framework

Run/lint/test against the dev env via uv:

uv pip install -e ".[dev]"
uv run pytest
uv run ruff check .
📜 Single-file script with inline deps (PEP 723 + uv run)

Drop dependencies directly into a script's header - uv run builds an ephemeral env, no manual install. Save as agent.py:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "strands-agents[openai]",
#     "strands-cosmos",
# ]
# ///
import os, sys
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, cosmos3_caption, cosmos3_temporal, video_probe

model = Cosmos3ReasonerModel(base_url=os.environ.get("COSMOS_BASE_URL", "http://localhost:8000/v1"))
agent = Agent(model=model, tools=[cosmos3_caption, cosmos3_temporal, video_probe])
agent(" ".join(sys.argv[1:]) or "Caption in detail: <video>scene.mp4</video>")
uv run agent.py "List the events with timestamps: <video>scene.mp4</video>"

🌌 Cosmos 3 - Omnimodal World Models

Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces - the Reasoner (vLLM, text+vision → text) and the Generator (Diffusers, → image/video/audio/action).

See it end-to-end: Reason → Generate

Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (one with synchronized audio) from its own description - all on a single local GPU.

① Input video ② Cosmos 3 understands it
input

"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."

  • Cosmos3ReasonerModel (caption in 5.2s)

The reasoner distills its own understanding into a generation prompt:

"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."

Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):

text → video text → video + 🔊 sound image → video
text2video text2video+sound image2video
55.5s 43.2s · AAC stereo 48kHz 42.1s · from a real frame

→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with uv run examples/09_cosmos3_showcase.py

Capabilities & tool map
Surface Tools Backend
Reasoner cosmos3_reason, cosmos3_caption, cosmos3_temporal, cosmos3_embodied, cosmos3_ground, cosmos3_plausibility, cosmos3_situation, cosmos3_action_cot vLLM
Generator cosmos3_text2image, cosmos3_text2video, cosmos3_image2video, cosmos3_text2video_sound Diffusers Cosmos3OmniPipeline (in-proc)
Video-to-video cosmos3_video2video (transfer: day→night, recolor, restyle) vLLM-Omni Docker (vllm/vllm-omni:cosmos3)
Action / World-Model cosmos3_forward_dynamics, cosmos3_inverse_dynamics, cosmos3_policy Cosmos Framework (torchrun)
Training (SFT) cosmos3_train, cosmos3_train_convert, cosmos3_train_show, cosmos3_train_export, … Cosmos Framework (torchrun)
Servers cosmos3_serve start / stop / status
just c3-setup-gen       # generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4

just c3-setup-framework # action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out      # forward/inverse dynamics, policy
just c3-train-recipes                   # list SFT recipes
just c3-train vision_sft_nano           # fine-tune (8x H100); see the training guide

📖 Full guides: Cosmos 3 · Training/SFT

Models
Model Size Capability
Cosmos3-Nano 16B Omnimodal (reasoner + generator + action) - fits a single ~46GB GPU
Cosmos3-Super 64B Frontier-scale (multi-GPU / tensor-parallel)
Cosmos3-Nano-Policy-DROID 16B VL robot policy (DROID)

🤖 Cosmos-Reason2 - Lightweight Edge VLM

For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint - verified on Jetson AGX Thor with Chain-of-Thought reasoning.

from strands import Agent
from strands_cosmos import CosmosVisionModel

agent = Agent(model=CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B"))
agent("Caption in detail: <video>dashcam.mp4</video>")                  # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?")   # image reasoning
agent("What happens when a ball rolls off a table?")                    # text-only physics
Demo, models & performance

Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor

Strands Cosmos Demo
Model GPU Memory Use Case
Cosmos-Reason2-2B 24GB Edge deployment (Jetson Thor/Orin)
Cosmos-Reason2-8B 32GB Cloud/desktop high-accuracy

Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.

Jetson install:

uv pip install strands-cosmos
strands-cosmos-fix-cublas   # fix CUBLAS for Jetson GPU architecture

🧰 Tools

Use any of the 45 tools inside a Strands Agent for full-pipeline automation:

from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo

agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")
Full tool catalog (Reason2 / Predict / Transfer / lifecycle / data / eval)
Category Tools Description
Reason2 VLM cosmos_inference, cosmos_reason_hf, cosmos_serve TRT server inference, HF direct inference, server lifecycle
Predict 2.5 cosmos_predict_generate World-model video generation (future frame prediction)
Transfer 2.5 cosmos_transfer_generate ControlNet video-to-video (depth/edge/sketch→video)
Model Lifecycle cosmos_model_download, cosmos_quantize, cosmos_export_onnx, cosmos_build_engine Download, FP8 quantize, ONNX export, TRT engine build
Training cosmos_post_train, cosmos_distill SFT/LoRA post-training, knowledge distillation
Data cosmos_curate Xenna data curation pipeline
Evaluation cosmos_evaluate FID/FVD/CSE/CLIP benchmark evaluation
I/O rtp_capture_frame, nats_publish, video_probe, video_extract_frames, image_read RTP capture, NATS messaging, video/image utilities
System cosmos_sysinfo GPU/platform diagnostics

⚙️ Configuration
# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)

# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)

# Cosmos-Reason2 VLM
CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-8B",
    reasoning=True,           # Chain-of-thought <think>...</think>
    fps=4,
    params={"max_tokens": 4096, "temperature": 0.6},
)
🏗️ Architecture
strands_cosmos/
├── cosmos3_reasoner_model.py    # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py   # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py       # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py              # CosmosModel (Reason2 text-only)
├── fix_cublas.py                # Jetson CUBLAS compatibility fix
├── tools/
│   ├── cosmos3.py               # 16 Cosmos 3 tools (reason/generate/action/serve)
│   ├── inference.py · reason_hf.py · serve.py        # Reason2 VLM
│   ├── predict_generate.py · transfer_generate.py    # Predict2.5 / Transfer2.5
│   ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│   ├── post_train.py · distill.py · curate.py · evaluate.py
│   └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile                     # Developer workflow + c3-* recipes
✅ Verified platforms
Platform GPU Status
Desktop / Cloud NVIDIA L40S / A100 / H100 / RTX 4090 ✅ Cosmos 3 + Reason2
Jetson AGX Thor NVIDIA Thor 132GB ✅ Reason2 (with CUBLAS fix)
Jetson Orin 32/64GB ✅ Reason2 (may need CUBLAS fix)
🩺 Troubleshooting

CUBLAS_STATUS_INVALID_VALUE on Jetson

strands-cosmos-fix-cublas    # replaces torch's bundled CUBLAS with JetPack system CUBLAS

Cosmos 3 reasoner OOM on a single GPU - the default sequence length (262K) needs a huge KV cache. just c3-serve-reason caps it at --max-model-len 32768. Stop the generator before serving the reasoner (and vice versa).

StopIteration in get_rope_index during video (Reason2) - already handled; strands-cosmos pins a compatible transformers range. If you still see it:

uv pip install "transformers>=4.57.0,<5.3.0"

module 'torchvision.io' has no attribute 'read_video' - transformers 5.x decodes video with torchcodec and falls back to torchvision, which removed io.read_video in >=0.27:

uv pip install torchcodec

TRT tools return exit 127 - expected on workstations; those run on Jetson or in TRT Docker. Run just doctor.


Resources


License

Licensed under the Apache License 2.0. See NOTICE for attribution and SECURITY.md for vulnerability reporting.

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | Built with NVIDIA Cosmos and Strands Agents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_for_cosmos-0.7.0.tar.gz (102.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strands_for_cosmos-0.7.0-py3-none-any.whl (109.6 kB view details)

Uploaded Python 3

File details

Details for the file strands_for_cosmos-0.7.0.tar.gz.

File metadata

  • Download URL: strands_for_cosmos-0.7.0.tar.gz
  • Upload date:
  • Size: 102.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for strands_for_cosmos-0.7.0.tar.gz
Algorithm Hash digest
SHA256 4705acd8d51c8c0beb612105baf7c19ab84be5edb9aa57deebba96a0998c6e77
MD5 7b68c34c63affc783ef78658360e3121
BLAKE2b-256 3431b69b2d6f55574a614de46dff4f1ce72a6b266cf50c5105ccc1c8120fb727

See more details on using hashes here.

File details

Details for the file strands_for_cosmos-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for strands_for_cosmos-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f607960c58df1137f44730198bc2272f333c6a631e293885d37421f448ee994
MD5 5f0e88eb240c1847fb6fa0f3cefae46a
BLAKE2b-256 51135f7894ce25fb865be51abdeb2c41dfef80c2f11ce19686b643e1c8a7021b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page