Skip to main content

NVIDIA Cosmos Reason VLM provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence

Project description

strands-cosmos

PyPI version Docs

Awesome Strands Agents

Strands Cosmos

NVIDIA Cosmos toolkit for Strands Agents — from VLM reasoning to world-model generation, edge deployment, and evaluation.

Provides 4 Strands model providers (Cosmos-Reason2 VLM + the new Cosmos 3 omnimodal Reasoner & Generator) plus 45 tools covering the entire NVIDIA Cosmos ecosystem: VLM reasoning, world-model generation (image/video/audio/action), video-to-video (Transfer2.5), data curation (Xenna), post-training, distillation, quantization, edge deployment, and evaluation. Local compute.


NVIDIA Cosmos toolkit for Strands Agents — omnimodal world-model reasoning and generation, on local compute.

Cosmos models become first-class Strands model providers — give your agent eyes that understand physics, and hands that can generate video, audio, and robot actions. Plus 45 tools spanning the full Cosmos pipeline (inference, generation, curation, post-training, quantization, edge deployment, evaluation).

Family Providers Best for
Cosmos 3 (latest, omnimodal) Cosmos3ReasonerModel, Cosmos3GeneratorModel Video/image/audio/action understanding + generation
Cosmos-Reason2 (VLM) CosmosVisionModel, CosmosModel Lightweight edge VLM (Jetson Thor/Orin)

🌌 Cosmos 3 — Omnimodal World Models

Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces:

  • Reasoner (Cosmos3ReasonerModel, vLLM) — text + vision → text
  • Generator (Cosmos3GeneratorModel, Diffusers) — text/image → image/video/audio/action

See it end-to-end: Reason → Generate

Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (including one with synchronized audio) from its own description — all on a single local GPU.

① Input video ② Cosmos 3 understands it
input

"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."

Cosmos3ReasonerModel (caption in 5.2s)

The reasoner distills its own understanding into a generation prompt:

"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."

Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):

text → video text → video + 🔊 sound image → video
text2video text2video+sound image2video
55.5s 43.2s · AAC stereo 48kHz 42.1s · from a real frame

→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with python examples/09_cosmos3_showcase.py

Quick start (Cosmos 3)

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, Cosmos3GeneratorModel

# Reasoner — text + vision -> text (local vLLM server; start with `just c3-serve-reason`)
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")

# Generator — text/image -> image/video/sound (in-process Diffusers, no server)
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video",            prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.", out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video",           prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")

Cosmos 3 capabilities

Surface Tools Backend
Reasoner cosmos3_reason, cosmos3_caption, cosmos3_temporal, cosmos3_embodied, cosmos3_ground, cosmos3_plausibility, cosmos3_situation, cosmos3_action_cot vLLM
Generator cosmos3_text2image, cosmos3_text2video, cosmos3_image2video, cosmos3_text2video_sound Diffusers Cosmos3OmniPipeline (in-proc)
Video-to-video cosmos3_video2video (transfer: day→night, recolor, restyle) vLLM-Omni Docker (vllm/vllm-omni:cosmos3)
Action / World-Model cosmos3_forward_dynamics, cosmos3_inverse_dynamics, cosmos3_policy Cosmos Framework (torchrun)
Training (SFT) cosmos3_train, cosmos3_train_convert, cosmos3_train_show, cosmos3_train_export, … Cosmos Framework (torchrun)
Servers cosmos3_serve start / stop / status

Cosmos 3 models

Model Size Capability
Cosmos3-Nano 16B Omnimodal (reasoner + generator + action) — fits a single ~46GB GPU
Cosmos3-Super 64B Frontier-scale (multi-GPU / tensor-parallel)
Cosmos3-Nano-Policy-DROID 16B VL robot policy (DROID)

Setup

Install the generator extras in one shot (Diffusers + guardrail + audio):

pip install "strands-cosmos[cosmos3-gen]"   # text/image -> image/video/sound
pip install -U "git+https://github.com/huggingface/diffusers.git"  # Cosmos3OmniPipeline (dev build)
pip install "strands-cosmos[cosmos3]"       # reasoner client (vLLM server)

Or use the justfile to build dedicated, CUDA-matched environments:

just c3-doctor          # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason    # Reasoner env: vllm + vllm-cosmos3
just c3-serve-reason    # serve Cosmos3-Nano on :8000
just c3-reason "Caption in detail." "" scene.mp4 caption

just c3-setup-gen       # Generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4

just c3-setup-framework # Action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out      # forward/inverse dynamics, policy
just c3-train-recipes                   # list SFT recipes
just c3-train vision_sft_nano           # fine-tune (8x H100); see the training guide

CUDA pairing: match the torch backend to your driver — CUDA 13 → cu130 + vllm==0.21.0; CUDA 12.8 → cu128 + vllm==0.19.1. just c3-doctor reports your driver's recommendation.

Single-GPU note: the reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't fit on one ~46GB GPU together — stop one before running the other, or use separate GPUs.

📖 Full guides: Cosmos 3 · Training/SFT


Install

pip install strands-cosmos

Developer Setup

git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full    # Installs system deps, Python deps, clones all Cosmos repos
just doctor        # Verify everything

NVIDIA Jetson (Thor, Orin, AGX)

pip install strands-cosmos
strands-cosmos-fix-cublas   # Fix CUBLAS for Jetson GPU architecture

Cosmos-Reason2 (Lightweight Edge VLM)

For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint — verified on Jetson AGX Thor with Chain-of-Thought reasoning.

Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor

Strands Cosmos Demo
from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)

agent("Caption in detail: <video>dashcam.mp4</video>")          # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?")  # image reasoning
agent("What happens when a ball rolls off a table?")            # text-only physics
Model GPU Memory Use Case
Cosmos-Reason2-2B 24GB Edge deployment (Jetson Thor/Orin)
Cosmos-Reason2-8B 32GB Cloud/desktop high-accuracy

Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.


Pipeline Tools (Cosmos-Reason2 / Predict / Transfer)

Use any tool inside a Strands Agent for full Cosmos pipeline automation:

Category Tools Description
Reason2 VLM cosmos_inference, cosmos_reason_hf, cosmos_serve TRT server inference, HF direct inference, server lifecycle
Predict 2.5 cosmos_predict_generate World-model video generation (future frame prediction)
Transfer 2.5 cosmos_transfer_generate ControlNet video-to-video (depth/edge/sketch→video)
Model Lifecycle cosmos_model_download, cosmos_quantize, cosmos_export_onnx, cosmos_build_engine Download, FP8 quantize, ONNX export, TRT engine build
Training cosmos_post_train, cosmos_distill SFT/LoRA post-training, knowledge distillation
Data cosmos_curate Xenna data curation pipeline
Evaluation cosmos_evaluate FID/FVD/CSE/CLIP benchmark evaluation
I/O rtp_capture_frame, nats_publish, video_probe, video_extract_frames, image_read RTP capture, NATS messaging, video/image utilities
System cosmos_sysinfo GPU/platform diagnostics
from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo

agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")

Architecture

strands_cosmos/
├── cosmos3_reasoner_model.py    # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py   # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py       # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py              # CosmosModel (Reason2 text-only)
├── fix_cublas.py                # Jetson CUBLAS compatibility fix
├── tools/
│   ├── cosmos3.py               # 16 Cosmos 3 tools (reason/generate/action/serve)
│   ├── inference.py · reason_hf.py · serve.py        # Reason2 VLM
│   ├── predict_generate.py · transfer_generate.py    # Predict2.5 / Transfer2.5
│   ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│   ├── post_train.py · distill.py · curate.py · evaluate.py
│   └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile                     # Developer workflow + c3-* recipes

Configuration

# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)

# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)

# Cosmos-Reason2 VLM
CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-8B",
    reasoning=True,           # Chain-of-thought <think>...</think>
    fps=4,
    params={"max_tokens": 4096, "temperature": 0.6},
)

Verified Platforms

Platform GPU Status
Desktop / Cloud NVIDIA L40S / A100 / H100 / RTX 4090 ✅ Cosmos 3 + Reason2
Jetson AGX Thor NVIDIA Thor 132GB ✅ Reason2 (with CUBLAS fix)
Jetson Orin 32/64GB ✅ Reason2 (may need CUBLAS fix)

Troubleshooting

CUBLAS_STATUS_INVALID_VALUE on Jetson

strands-cosmos-fix-cublas    # Replaces torch's bundled CUBLAS with JetPack system CUBLAS

Cosmos 3 reasoner OOM on a single GPU

The default sequence length (262K) needs a huge KV cache. Cap it: just c3-serve-reason sets --max-model-len 32768. Stop the generator before serving the reasoner (and vice versa).

StopIteration in get_rope_index during video (Reason2)

Already handled — strands-cosmos pins a compatible transformers range. If you see it:

pip install "transformers>=4.57.0,<5.3.0"

Video caption fails with module 'torchvision.io' has no attribute 'read_video'

transformers 5.x decodes video with torchcodec and falls back to torchvision, which removed io.read_video in >=0.27. Install torchcodec (now a dependency):

pip install torchcodec

TRT tools return exit 127

Expected on workstations — those tools run on Jetson or in TRT Docker. Run just doctor.


Resources


License

Apache 2.0 | Built with NVIDIA Cosmos and Strands Agents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_cosmos-0.4.4.tar.gz (20.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strands_cosmos-0.4.4-py3-none-any.whl (81.9 kB view details)

Uploaded Python 3

File details

Details for the file strands_cosmos-0.4.4.tar.gz.

File metadata

  • Download URL: strands_cosmos-0.4.4.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for strands_cosmos-0.4.4.tar.gz
Algorithm Hash digest
SHA256 24db6f3ea3a10f45ef5ae7e9096eb19117b79acbc093edbd0dff4b0523f2429c
MD5 0168178f17877f6840ae3f2a20c35d32
BLAKE2b-256 fad18bbc7330cdedf7cec0784d4697bd13cf3c3e25181453a4f33453618e5b94

See more details on using hashes here.

File details

Details for the file strands_cosmos-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: strands_cosmos-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 81.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for strands_cosmos-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 010f1a26768a1c1a936925dba0d7554f27fda5aaab458ba947140678b8a0fe04
MD5 0db25f0c82e08a8e6a474e3cdb053468
BLAKE2b-256 5bd328b351cf4b20fd9da46683c62b190e90629b1a2253d867220e1eb57c4960

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page