NVIDIA Cosmos Reason VLM provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence

These details have not been verified by PyPI

Project links

Project description

strands-cosmos

Strands Cosmos

NVIDIA Cosmos toolkit for Strands Agents — from VLM reasoning to world-model generation, edge deployment, and evaluation.

Provides 4 Strands model providers (Cosmos-Reason2 VLM + the new Cosmos 3 omnimodal Reasoner & Generator) plus 44 tools covering the entire NVIDIA Cosmos ecosystem: VLM reasoning, world-model generation (image/video/audio/action), video-to-video (Transfer2.5), data curation (Xenna), post-training, distillation, quantization, edge deployment, and evaluation. Local compute.

NVIDIA Cosmos toolkit for Strands Agents — omnimodal world-model reasoning and generation, on local compute.

Cosmos models become first-class Strands model providers — give your agent eyes that understand physics, and hands that can generate video, audio, and robot actions. Plus 44 tools spanning the full Cosmos pipeline (inference, generation, curation, post-training, quantization, edge deployment, evaluation).

Family	Providers	Best for
Cosmos 3 (latest, omnimodal)	`Cosmos3ReasonerModel`, `Cosmos3GeneratorModel`	Video/image/audio/action understanding + generation
Cosmos-Reason2 (VLM)	`CosmosVisionModel`, `CosmosModel`	Lightweight edge VLM (Jetson Thor/Orin)

🌌 Cosmos 3 — Omnimodal World Models

Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces:

Reasoner (Cosmos3ReasonerModel, vLLM) — text + vision → text
Generator (Cosmos3GeneratorModel, Diffusers) — text/image → image/video/audio/action

See it end-to-end: Reason → Generate

Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (including one with synchronized audio) from its own description — all on a single local GPU.

① Input video ② Cosmos 3 understands it

① Input video	② Cosmos 3 understands it
	"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions." — `Cosmos3ReasonerModel` (caption in 5.2s)

"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."

— Cosmos3ReasonerModel (caption in 5.2s)

The reasoner distills its own understanding into a generation prompt:

"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."

Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):

text → video	text → video + 🔊 sound	image → video

_55.5s	_{43.2s · AAC stereo 48kHz}	_{42.1s · from a real frame}

→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with python examples/09_cosmos3_showcase.py

Quick start (Cosmos 3)

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, Cosmos3GeneratorModel

# Reasoner — text + vision -> text (local vLLM server; start with `just c3-serve-reason`)
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")

# Generator — text/image -> image/video/sound (in-process Diffusers, no server)
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video",            prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.", out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video",           prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")

Cosmos 3 capabilities

Surface	Tools	Backend
Reasoner	`cosmos3_reason`, `cosmos3_caption`, `cosmos3_temporal`, `cosmos3_embodied`, `cosmos3_ground`, `cosmos3_plausibility`, `cosmos3_situation`, `cosmos3_action_cot`	vLLM
Generator	`cosmos3_text2image`, `cosmos3_text2video`, `cosmos3_image2video`, `cosmos3_text2video_sound`	Diffusers `Cosmos3OmniPipeline` (in-proc)
Action / World-Model	`cosmos3_forward_dynamics`, `cosmos3_inverse_dynamics`, `cosmos3_policy`	Cosmos Framework (torchrun)
Training (SFT)	`cosmos3_train`, `cosmos3_train_convert`, `cosmos3_train_show`, `cosmos3_train_export`, …	Cosmos Framework (torchrun)
Servers	`cosmos3_serve`	start / stop / status

Cosmos 3 models

Model	Size	Capability
Cosmos3-Nano	16B	Omnimodal (reasoner + generator + action) — fits a single ~46GB GPU
Cosmos3-Super	64B	Frontier-scale (multi-GPU / tensor-parallel)
Cosmos3-Nano-Policy-DROID	16B	VL robot policy (DROID)

Setup

Install the generator extras in one shot (Diffusers + guardrail + audio):

pip install "strands-cosmos[cosmos3-gen]"   # text/image -> image/video/sound
pip install -U "git+https://github.com/huggingface/diffusers.git"  # Cosmos3OmniPipeline (dev build)
pip install "strands-cosmos[cosmos3]"       # reasoner client (vLLM server)

Or use the justfile to build dedicated, CUDA-matched environments:

just c3-doctor          # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason    # Reasoner env: vllm + vllm-cosmos3
just c3-serve-reason    # serve Cosmos3-Nano on :8000
just c3-reason "Caption in detail." "" scene.mp4 caption

just c3-setup-gen       # Generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4

just c3-setup-framework # Action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out      # forward/inverse dynamics, policy
just c3-train-recipes                   # list SFT recipes
just c3-train vision_sft_nano           # fine-tune (8x H100); see the training guide

CUDA pairing: match the torch backend to your driver — CUDA 13 → cu130 + vllm==0.21.0; CUDA 12.8 → cu128 + vllm==0.19.1. just c3-doctor reports your driver's recommendation.

Single-GPU note: the reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't fit on one ~46GB GPU together — stop one before running the other, or use separate GPUs.

📖 Full guides: Cosmos 3 · Training/SFT

Install

pip install strands-cosmos

Developer Setup

git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full    # Installs system deps, Python deps, clones all Cosmos repos
just doctor        # Verify everything

NVIDIA Jetson (Thor, Orin, AGX)

pip install strands-cosmos
strands-cosmos-fix-cublas   # Fix CUBLAS for Jetson GPU architecture

Cosmos-Reason2 (Lightweight Edge VLM)

For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint — verified on Jetson AGX Thor with Chain-of-Thought reasoning.

Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)

agent("Caption in detail: <video>dashcam.mp4</video>")          # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?")  # image reasoning
agent("What happens when a ball rolls off a table?")            # text-only physics

Model	GPU Memory	Use Case
Cosmos-Reason2-2B	24GB	Edge deployment (Jetson Thor/Orin)
Cosmos-Reason2-8B	32GB	Cloud/desktop high-accuracy

Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.

Pipeline Tools (Cosmos-Reason2 / Predict / Transfer)

Use any tool inside a Strands Agent for full Cosmos pipeline automation:

Category	Tools	Description
Reason2 VLM	`cosmos_inference`, `cosmos_reason_hf`, `cosmos_serve`	TRT server inference, HF direct inference, server lifecycle
Predict 2.5	`cosmos_predict_generate`	World-model video generation (future frame prediction)
Transfer 2.5	`cosmos_transfer_generate`	ControlNet video-to-video (depth/edge/sketch→video)
Model Lifecycle	`cosmos_model_download`, `cosmos_quantize`, `cosmos_export_onnx`, `cosmos_build_engine`	Download, FP8 quantize, ONNX export, TRT engine build
Training	`cosmos_post_train`, `cosmos_distill`	SFT/LoRA post-training, knowledge distillation
Data	`cosmos_curate`	Xenna data curation pipeline
Evaluation	`cosmos_evaluate`	FID/FVD/CSE/CLIP benchmark evaluation
I/O	`rtp_capture_frame`, `nats_publish`, `video_probe`, `video_extract_frames`, `image_read`	RTP capture, NATS messaging, video/image utilities
System	`cosmos_sysinfo`	GPU/platform diagnostics

from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo

agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")

Architecture

strands_cosmos/
├── cosmos3_reasoner_model.py    # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py   # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py       # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py              # CosmosModel (Reason2 text-only)
├── fix_cublas.py                # Jetson CUBLAS compatibility fix
├── tools/
│   ├── cosmos3.py               # 16 Cosmos 3 tools (reason/generate/action/serve)
│   ├── inference.py · reason_hf.py · serve.py        # Reason2 VLM
│   ├── predict_generate.py · transfer_generate.py    # Predict2.5 / Transfer2.5
│   ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│   ├── post_train.py · distill.py · curate.py · evaluate.py
│   └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile                     # Developer workflow + c3-* recipes

Configuration

# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)

# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)

# Cosmos-Reason2 VLM
CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-8B",
    reasoning=True,           # Chain-of-thought <think>...</think>
    fps=4,
    params={"max_tokens": 4096, "temperature": 0.6},
)

Verified Platforms

Platform	GPU	Status
Desktop / Cloud	NVIDIA L40S / A100 / H100 / RTX 4090	✅ Cosmos 3 + Reason2
Jetson AGX Thor	NVIDIA Thor 132GB	✅ Reason2 (with CUBLAS fix)
Jetson Orin	32/64GB	✅ Reason2 (may need CUBLAS fix)

Troubleshooting

`CUBLAS_STATUS_INVALID_VALUE` on Jetson

strands-cosmos-fix-cublas    # Replaces torch's bundled CUBLAS with JetPack system CUBLAS

Cosmos 3 reasoner OOM on a single GPU

The default sequence length (262K) needs a huge KV cache. Cap it: just c3-serve-reason sets --max-model-len 32768. Stop the generator before serving the reasoner (and vice versa).

`StopIteration` in `get_rope_index` during video (Reason2)

Already handled — strands-cosmos pins a compatible transformers range. If you see it:

pip install "transformers>=4.57.0,<5.3.0"

Video caption fails with `module 'torchvision.io' has no attribute 'read_video'`

transformers 5.x decodes video with torchcodec and falls back to torchvision, which removed io.read_video in >=0.27. Install torchcodec (now a dependency):

pip install torchcodec

TRT tools return exit 127

Expected on workstations — those tools run on Jetson or in TRT Docker. Run just doctor.

Resources

Changelog — Release history
Cosmos 3 — Latest omnimodal world models
Cosmos Cookbook — Official recipes
Cosmos-Reason2 — VLM source
Strands Agents — Agent framework
strands-mlx — Apple Silicon provider

License

Apache 2.0 | Built with NVIDIA Cosmos and Strands Agents

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0

Jun 13, 2026

0.6.0

Jun 13, 2026

0.5.0

Jun 5, 2026

0.4.4

Jun 4, 2026

0.4.3

Jun 4, 2026

0.4.2

Jun 4, 2026

This version

0.4.1

Jun 4, 2026

0.3.1

Jun 4, 2026

0.3.0

Jun 4, 2026

0.2.0

May 8, 2026

0.1.2

Mar 22, 2026

0.1.1

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_cosmos-0.4.1.tar.gz (19.0 MB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strands_cosmos-0.4.1-py3-none-any.whl (66.1 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file strands_cosmos-0.4.1.tar.gz.

File metadata

Download URL: strands_cosmos-0.4.1.tar.gz
Upload date: Jun 4, 2026
Size: 19.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for strands_cosmos-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`c25e7a3ccc477030d618c36f191a45519c138ee1f20dd800ead958f2f9d84a5a`
MD5	`da12ea71fa4c43862d7bdb73b950026e`
BLAKE2b-256	`814c69086237e539af57dba95500b04e7ecaa8ef63c3d1dc83fd2f02a5e2b86a`

See more details on using hashes here.

File details

Details for the file strands_cosmos-0.4.1-py3-none-any.whl.

File metadata

Download URL: strands_cosmos-0.4.1-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 66.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for strands_cosmos-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`020ed4fe3d47b0126ae737296c57656c89c2d4a98b9144fd158e9e2c9520ef5d`
MD5	`c7e152e44b0f8e039f2dd4c2939a71a0`
BLAKE2b-256	`fd4d2e28e41f6771c5cdf4e5e531d14e2cebbb055f3f9c7b9b95f017c88667ae`

See more details on using hashes here.

strands-cosmos 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

strands-cosmos

🌌 Cosmos 3 — Omnimodal World Models

See it end-to-end: Reason → Generate

Quick start (Cosmos 3)

Cosmos 3 capabilities

Cosmos 3 models

Setup

Install

Developer Setup

NVIDIA Jetson (Thor, Orin, AGX)

Cosmos-Reason2 (Lightweight Edge VLM)

Pipeline Tools (Cosmos-Reason2 / Predict / Transfer)

Architecture

Configuration

Verified Platforms

Troubleshooting

CUBLAS_STATUS_INVALID_VALUE on Jetson

Cosmos 3 reasoner OOM on a single GPU

StopIteration in get_rope_index during video (Reason2)

Video caption fails with module 'torchvision.io' has no attribute 'read_video'

TRT tools return exit 127

Resources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`CUBLAS_STATUS_INVALID_VALUE` on Jetson

`StopIteration` in `get_rope_index` during video (Reason2)

Video caption fails with `module 'torchvision.io' has no attribute 'read_video'`