NVIDIA Cosmos provider for Strands Agents - physical AI reasoning, video understanding, and embodied intelligence

These details have not been verified by PyPI

Project links

Project description

strands-cosmos

Strands Cosmos

NVIDIA Cosmos for Strands Agents. Give your agent eyes that understand physics and hands that generate video, audio, and robot actions - on local compute.

4 model providers (Cosmos 3 omnimodal Reasoner & Generator + Cosmos-Reason2 VLM) and 45 tools spanning the full pipeline: reasoning, generation, curation, post-training, quantization, edge deployment, and evaluation.

⏱️ Learn it in 90 seconds

1. Install (we use uv everywhere):

uv pip install strands-cosmos

2. Understand video - the reasoner reads vision and reasons in text. It talks to a local vLLM server - see the Run the reasoner server dropdown just below to start one, then:

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel

agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
agent("List the notable events with timestamps: <video>scene.mp4</video>")

3. Generate video - the generator runs in-process (Diffusers, no server):

from strands_cosmos import Cosmos3GeneratorModel

gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video",            prompt="A robot navigates a warehouse.", out_path="vid.mp4")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.",           out_path="av.mp4", enable_sound=True)
gen.generate(mode="image2video",           prompt="It moves forward.", image="frame.jpg", out_path="i2v.mp4")

That's the whole loop: understand → generate. Everything below is depth - expand only what you need.

You want to…	Provider	Where
Understand video/image (text+vision → text)	`Cosmos3ReasonerModel`	vLLM server
Generate image/video/audio/action	`Cosmos3GeneratorModel`	in-process Diffusers
Run a tiny VLM on Jetson edge	`CosmosVisionModel`, `CosmosModel`	Edge VLM
Drive the full pipeline from tools	45 `cosmos*` tools	Tools

🚀 Run the reasoner server (one-time setup + serve)

The reasoner needs a vLLM server. Build a CUDA-matched env and serve Cosmos3-Nano on :8000:

just c3-doctor          # check GPU / CUDA / uv / venvs / disk + recommended CUDA pairing
just c3-setup-reason    # build the reasoner venv: vllm + vllm-cosmos3 (uv-managed)
just c3-serve-reason    # serve Cosmos3-Nano on :8000 (--max-model-len 32768)

Verify it's up before pointing an agent at it:

curl -s http://localhost:8000/v1/models    # → {"data":[{"id":"nvidia/Cosmos3-Nano",...}]}

Or one-shot a caption straight from the justfile (no Python):

just c3-reason "Caption in detail." "" scene.mp4 caption

CUDA pairing: match torch to your driver - CUDA 13 → cu130 + vllm==0.21.0; CUDA 12.8 → cu128 + vllm==0.19.1. just c3-doctor reports your driver's recommendation.

Single-GPU note: reasoner (vLLM) and generator (Diffusers) each load a 16B model and won't co-fit on one ~46GB GPU - stop one before running the other, or use separate GPUs.

📦 Install matrix (pick the extra for your task)

uv pip install strands-cosmos                  # core: Reason2 VLM + all tools
uv pip install "strands-cosmos[cosmos3]"       # + Cosmos 3 reasoner client (vLLM server)
uv pip install "strands-cosmos[cosmos3-gen]"   # + Cosmos 3 generator (in-proc Diffusers: image/video/sound)
uv pip install "strands-cosmos[vllm]"          # + bundled vLLM + openai client
uv pip install "strands-cosmos[all]"           # everything (heavy)

The generator needs the diffusers dev build (Cosmos3OmniPipeline); PyPI forbids direct-URL deps, so pin it at install time (or just use just c3-setup-gen):

uv pip install -U "git+https://github.com/huggingface/diffusers.git"

Extra	Pulls in	For
(none)	transformers, torch, torchvision, torchcodec, av	Reason2 VLM + tools
`cosmos3`	`openai`	Cosmos 3 reasoner client
`cosmos3-gen`	diffusers, cosmos_guardrail, soundfile, imageio	Cosmos 3 generator
`vllm`	vllm, openai	self-hosting vLLM
`jetson`	torchcodec	Jetson companions (torch via JetPack)
`all`	all of the above + dev tools	kitchen sink

🛠️ Developer setup (clone + build everything)

git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full    # system deps, Python deps, clones all Cosmos repos (uv-managed venvs)
just doctor        # verify everything

# dedicated, CUDA-matched envs (each is its own uv venv):
just c3-setup-reason      # reasoner: vllm + vllm-cosmos3
just c3-setup-gen         # generator: diffusers(main) + cosmos_guardrail
just c3-setup-framework   # action + training: Cosmos Framework

Run/lint/test against the dev env via uv:

uv pip install -e ".[dev]"
uv run pytest
uv run ruff check .

📜 Single-file script with inline deps (PEP 723 + uv run)

Drop dependencies directly into a script's header - uv run builds an ephemeral env, no manual install. Save as agent.py:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "strands-agents[openai]",
#     "strands-cosmos",
# ]
# ///
import os, sys
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, cosmos3_caption, cosmos3_temporal, video_probe

model = Cosmos3ReasonerModel(base_url=os.environ.get("COSMOS_BASE_URL", "http://localhost:8000/v1"))
agent = Agent(model=model, tools=[cosmos3_caption, cosmos3_temporal, video_probe])
agent(" ".join(sys.argv[1:]) or "Caption in detail: <video>scene.mp4</video>")

uv run agent.py "List the events with timestamps: <video>scene.mp4</video>"

🌌 Cosmos 3 - Omnimodal World Models

Cosmos 3 is NVIDIA's newest model family: a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. strands-cosmos exposes both runtime surfaces - the Reasoner (vLLM, text+vision → text) and the Generator (Diffusers, → image/video/audio/action).

See it end-to-end: Reason → Generate

Cosmos 3 watches a real construction-site clip, describes it, then generates new videos (one with synchronized audio) from its own description - all on a single local GPU.

① Input video ② Cosmos 3 understands it

① Input video	② Cosmos 3 understands it
	"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions." `Cosmos3ReasonerModel` (caption in 5.2s)

"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."

Cosmos3ReasonerModel (caption in 5.2s)

The reasoner distills its own understanding into a generation prompt:

"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."

Then Cosmos3GeneratorModel generates similar videos from that prompt (832×480, 49f):

text → video	text → video + 🔊 sound	image → video

_55.5s	_{43.2s · AAC stereo 48kHz}	_{42.1s · from a real frame}

→ Full demo + MP4s + reasoning: demo/cosmos3_showcase/ · reproduce with uv run examples/09_cosmos3_showcase.py

Capabilities & tool map

Surface	Tools	Backend
Reasoner	`cosmos3_reason`, `cosmos3_caption`, `cosmos3_temporal`, `cosmos3_embodied`, `cosmos3_ground`, `cosmos3_plausibility`, `cosmos3_situation`, `cosmos3_action_cot`	vLLM
Generator	`cosmos3_text2image`, `cosmos3_text2video`, `cosmos3_image2video`, `cosmos3_text2video_sound`	Diffusers `Cosmos3OmniPipeline` (in-proc)
Video-to-video	`cosmos3_video2video` (transfer: day→night, recolor, restyle)	vLLM-Omni Docker (`vllm/vllm-omni:cosmos3`)
Action / World-Model	`cosmos3_forward_dynamics`, `cosmos3_inverse_dynamics`, `cosmos3_policy`	Cosmos Framework (torchrun)
Training (SFT)	`cosmos3_train`, `cosmos3_train_convert`, `cosmos3_train_show`, `cosmos3_train_export`, …	Cosmos Framework (torchrun)
Servers	`cosmos3_serve`	start / stop / status

just c3-setup-gen       # generator env: diffusers(main) + cosmos_guardrail
just c3-gen text2video "A robot in a warehouse." "" out.mp4

just c3-setup-framework # action + training env: Cosmos Framework
just c3-action spec.jsonl /tmp/out      # forward/inverse dynamics, policy
just c3-train-recipes                   # list SFT recipes
just c3-train vision_sft_nano           # fine-tune (8x H100); see the training guide

📖 Full guides: Cosmos 3 · Training/SFT

Models

Model	Size	Capability
Cosmos3-Nano	16B	Omnimodal (reasoner + generator + action) - fits a single ~46GB GPU
Cosmos3-Super	64B	Frontier-scale (multi-GPU / tensor-parallel)
Cosmos3-Nano-Policy-DROID	16B	VL robot policy (DROID)

🤖 Cosmos-Reason2 - Lightweight Edge VLM

For edge/Jetson deployments, the Cosmos-Reason2 VLM runs as a Strands model provider with a tiny footprint - verified on Jetson AGX Thor with Chain-of-Thought reasoning.

from strands import Agent
from strands_cosmos import CosmosVisionModel

agent = Agent(model=CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B"))
agent("Caption in detail: <video>dashcam.mp4</video>")                  # video understanding
agent("<image>robot_view.jpg</image> What should the robot do next?")   # image reasoning
agent("What happens when a ball rolls off a table?")                    # text-only physics

Demo, models & performance

Dashcam safety analysis with Chain-of-Thought reasoning on Jetson AGX Thor

Model	GPU Memory	Use Case
Cosmos-Reason2-2B	24GB	Edge deployment (Jetson Thor/Orin)
Cosmos-Reason2-8B	32GB	Cloud/desktop high-accuracy

Performance (Jetson AGX Thor, Reason2-2B): text inference 1.4s (46 tokens) · video caption 2.2s (short clip @ 4fps), 7s load.

Jetson install:

uv pip install strands-cosmos
strands-cosmos-fix-cublas   # fix CUBLAS for Jetson GPU architecture

🧰 Tools

Use any of the 45 tools inside a Strands Agent for full-pipeline automation:

from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo

agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check the system, then analyze the video at /tmp/scene.mp4")

Full tool catalog (Reason2 / Predict / Transfer / lifecycle / data / eval)

Category	Tools	Description
Reason2 VLM	`cosmos_inference`, `cosmos_reason_hf`, `cosmos_serve`	TRT server inference, HF direct inference, server lifecycle
Predict 2.5	`cosmos_predict_generate`	World-model video generation (future frame prediction)
Transfer 2.5	`cosmos_transfer_generate`	ControlNet video-to-video (depth/edge/sketch→video)
Model Lifecycle	`cosmos_model_download`, `cosmos_quantize`, `cosmos_export_onnx`, `cosmos_build_engine`	Download, FP8 quantize, ONNX export, TRT engine build
Training	`cosmos_post_train`, `cosmos_distill`	SFT/LoRA post-training, knowledge distillation
Data	`cosmos_curate`	Xenna data curation pipeline
Evaluation	`cosmos_evaluate`	FID/FVD/CSE/CLIP benchmark evaluation
I/O	`rtp_capture_frame`, `nats_publish`, `video_probe`, `video_extract_frames`, `image_read`	RTP capture, NATS messaging, video/image utilities
System	`cosmos_sysinfo`	GPU/platform diagnostics

⚙️ Configuration

# Cosmos 3 Reasoner
Cosmos3ReasonerModel(base_url="http://localhost:8000/v1", reasoning=True, max_tokens=4096)

# Cosmos 3 Generator
Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano", guardrails=True)

# Cosmos-Reason2 VLM
CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-8B",
    reasoning=True,           # Chain-of-thought <think>...</think>
    fps=4,
    params={"max_tokens": 4096, "temperature": 0.6},
)

🏗️ Architecture

strands_cosmos/
├── cosmos3_reasoner_model.py    # Cosmos3ReasonerModel (vLLM, text+vision -> text)
├── cosmos3_generator_model.py   # Cosmos3GeneratorModel (Diffusers, -> image/video/sound)
├── cosmos_vision_model.py       # CosmosVisionModel (Reason2 VLM: video+image+text)
├── cosmos_model.py              # CosmosModel (Reason2 text-only)
├── fix_cublas.py                # Jetson CUBLAS compatibility fix
├── tools/
│   ├── cosmos3.py               # 16 Cosmos 3 tools (reason/generate/action/serve)
│   ├── inference.py · reason_hf.py · serve.py        # Reason2 VLM
│   ├── predict_generate.py · transfer_generate.py    # Predict2.5 / Transfer2.5
│   ├── model_download.py · quantize.py · export_onnx.py · build_engine.py
│   ├── post_train.py · distill.py · curate.py · evaluate.py
│   └── rtp.py · nats_pub.py · video_utils.py · image_read.py · sysinfo.py
└── justfile                     # Developer workflow + c3-* recipes

✅ Verified platforms

Platform	GPU	Status
Desktop / Cloud	NVIDIA L40S / A100 / H100 / RTX 4090	✅ Cosmos 3 + Reason2
Jetson AGX Thor	NVIDIA Thor 132GB	✅ Reason2 (with CUBLAS fix)
Jetson Orin	32/64GB	✅ Reason2 (may need CUBLAS fix)

🩺 Troubleshooting

CUBLAS_STATUS_INVALID_VALUE on Jetson

strands-cosmos-fix-cublas    # replaces torch's bundled CUBLAS with JetPack system CUBLAS

Cosmos 3 reasoner OOM on a single GPU - the default sequence length (262K) needs a huge KV cache. just c3-serve-reason caps it at --max-model-len 32768. Stop the generator before serving the reasoner (and vice versa).

StopIteration in get_rope_index during video (Reason2) - already handled; strands-cosmos pins a compatible transformers range. If you still see it:

uv pip install "transformers>=4.57.0,<5.3.0"

module 'torchvision.io' has no attribute 'read_video' - transformers 5.x decodes video with torchcodec and falls back to torchvision, which removed io.read_video in >=0.27:

uv pip install torchcodec

TRT tools return exit 127 - expected on workstations; those run on Jetson or in TRT Docker. Run just doctor.

Resources

Changelog - Release history
Cosmos 3 - Latest omnimodal world models
Cosmos Cookbook - Official recipes
Cosmos-Reason2 - VLM source
Strands Agents - Agent framework

License

Licensed under the Apache License 2.0. See NOTICE for attribution and SECURITY.md for vulnerability reporting.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7.0

Jun 13, 2026

0.6.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_for_cosmos-0.7.0.tar.gz (102.8 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strands_for_cosmos-0.7.0-py3-none-any.whl (109.6 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file strands_for_cosmos-0.7.0.tar.gz.

File metadata

Download URL: strands_for_cosmos-0.7.0.tar.gz
Upload date: Jun 13, 2026
Size: 102.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for strands_for_cosmos-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`4705acd8d51c8c0beb612105baf7c19ab84be5edb9aa57deebba96a0998c6e77`
MD5	`7b68c34c63affc783ef78658360e3121`
BLAKE2b-256	`3431b69b2d6f55574a614de46dff4f1ce72a6b266cf50c5105ccc1c8120fb727`

See more details on using hashes here.

File details

Details for the file strands_for_cosmos-0.7.0-py3-none-any.whl.

File metadata

Download URL: strands_for_cosmos-0.7.0-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 109.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for strands_for_cosmos-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f607960c58df1137f44730198bc2272f333c6a631e293885d37421f448ee994`
MD5	`5f0e88eb240c1847fb6fa0f3cefae46a`
BLAKE2b-256	`51135f7894ce25fb865be51abdeb2c41dfef80c2f11ce19686b643e1c8a7021b`

See more details on using hashes here.

strands-for-cosmos 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

strands-cosmos

⏱️ Learn it in 90 seconds

🌌 Cosmos 3 - Omnimodal World Models

🤖 Cosmos-Reason2 - Lightweight Edge VLM

🧰 Tools

Resources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes