strands-transformers

The universal entrypoint to HuggingFace transformers for Strands agents - 100% task & modality coverage, zero hardcoding.

These details have not been verified by PyPI

Project links

Project description

Strands Transformers - every modality in, every modality out: one tool, one local brain, zero hardcoding

Run any HuggingFace transformers model from a Strands agent. A tool for every task, or the agent's own multimodal brain. Local. No API keys.

python transformers license

📖 Docs · ⚡ 60-second hello · 👁️ See it work · 🧩 Two ways · 🧪 Examples

🤔 The idea

HuggingFace transformers already runs every model on earth. The missing piece is a clean, dynamic bridge into an agent loop - without writing per-model glue every time. This library is that bridge, two ways:

	What it is	You get
🛠️ `use_transformers`	one tool exposing every transformers task	discover · run a pipeline · `call` any class/method
🧠 `TransformerModel`	a local model as your `Agent(model=…)` brain	it sees, hears & speaks via content blocks

Zero hardcoding. core/registry.py reads transformers' own SUPPORTED_TASKS taxonomy at runtime - the day a task or model lands upstream, it works here. No code change. No version bump.

📦 Install

uv pip install strands-transformers          # from PyPI
PYTHONPATH=. python examples/smoke.py         # verify → "18/18 checks passed"

From source · optional extras (audio · vision · training)

uv pip install -e .                # editable from source
uv pip install -e ".[audio]"       # soundfile, librosa  (mp3/flac/ogg decode)
uv pip install -e ".[vision]"      # torchvision (needed by VLMs!), opencv, av
uv pip install -e ".[training]"    # trl, peft, accelerate
uv pip install -e ".[all]"         # everything

WAV audio works with no extras. Vision models (SmolVLM, Qwen-VL, …) need [vision]. device="auto" picks cuda → mps → cpu (bf16 on GPU).

⚡ 60-second hello

A 256M-param vision model, seeing pixels in the standard Strands loop - no key, no server:

import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG")  # a green square

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")

print(agent([
    {"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
    {"text": "Color? One word."},
]))
# → Green.

Swap model_path for any HF VLM and the code is identical.

👁️ See it work

Every result below is a real model output (CUDA · transformers 5.12 · torch 2.10):

You give it	It returns	Example
🖼️ a green image + "Color?"	`"Green."`	`multimodal_agent.py`
🎬 brightening video frames	`"BRIGHTER."`	`multimodal_advanced.py`
🧰 a blue tool screenshot (in `toolResult`)	`"Blue."`	`multimodal_advanced.py`
📄 a text document	recovers `BANANA-42`	`document_and_audio.py`
🔊 a 440 Hz tone (Omni)	`"It's a pure tone."`	`omni_audio.py`
💬 "say: …can speak" (Omni)	🔊 real 24 kHz speech	`omni_audio.py`
🦾 camera + "pick the cube"	actions `[1, 30, 6]`	`molmoact_vla.py`

📋 Copy-paste & run - the snippet behind each row

# 🖼️ image → "Green."   (local VLM brain, content blocks)
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

png = io.BytesIO(); Image.new("RGB", (224, 224), (20, 200, 40)).save(png, "PNG")
agent = Agent(model=TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct"))
print(agent([
    {"image": {"format": "png", "source": {"bytes": png.getvalue()}}},
    {"text": "What color is this image? One word."},
]))  # → Green.

# 🎬 video → "BRIGHTER."   (a video content block of brightening frames)
import asyncio, numpy as np
from PIL import Image
from strands_transformers import TransformerModel

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
                         params={"max_tokens": 48, "do_sample": False})
frames = [Image.fromarray(np.full((224, 224, 3), v, np.uint8)) for v in (10,40,80,120,160,200,230,250)]
msgs = [{"role": "user", "content": [
    {"video": {"format": "mp4", "fps": 2.0, "source": {"bytes": frames}}},
    {"text": "Does this video get brighter or darker? Answer brighter or darker."},
]}]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → ...brighter...

# 🧰 tool screenshot → "Blue."   (an image returned inside a toolResult)
import asyncio, io
from PIL import Image
from strands_transformers import TransformerModel

blue = io.BytesIO(); Image.new("RGB", (224, 224), (25, 25, 210)).save(blue, "PNG")
model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct",
                         params={"max_tokens": 32, "do_sample": False})
msgs = [
    {"role": "user", "content": [{"text": "Capture the screen, then name its color."}]},
    {"role": "assistant", "content": [{"toolUse": {"name": "capture", "toolUseId": "t1", "input": {}}}]},
    {"role": "user", "content": [{"toolResult": {"toolUseId": "t1", "status": "success", "content": [
        {"text": "Here is the captured screen:"},
        {"image": {"format": "png", "source": {"bytes": blue.getvalue()}}}]}}]},
    {"role": "user", "content": [{"text": "Dominant color of the captured screen? One word."}]},
]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → Blue.

# 📄 document → recovers "BANANA-42"   (a document content block → text LM prompt)
import asyncio
from strands_transformers import TransformerModel

model = TransformerModel(model_path="Qwen/Qwen3-0.6B", enable_thinking=False,
                         params={"max_tokens": 64, "do_sample": False})
body = b"The secret passphrase for the vault is BANANA-42. Keep it safe."
msgs = [{"role": "user", "content": [
    {"document": {"name": "secret", "format": "txt", "source": {"bytes": body}}},
    {"text": "What is the secret passphrase? Answer with just the passphrase."},
]}]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → BANANA-42

# 🔊 text → speech → text   (TTS then ASR, the tool path; the library narrating itself)
from strands_transformers import use_transformers

tts = use_transformers(action="run", task="text-to-audio",
                       model="facebook/mms-tts-eng",
                       inputs="the quick brown fox jumps over the lazy dog")
wav = tts["artifacts"][0]
asr = use_transformers(action="run", task="automatic-speech-recognition",
                       model="openai/whisper-tiny", inputs=wav)
print(asr["content"][0]["text"])  # → "...quick brown fox..."

# 🦾 camera + instruction → robot actions [1, 30, 6]   (VLA via the `call` path)
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from strands_transformers import use_transformers

REPO = "allenai/MolmoAct2-SO100_101"
top  = Image.open(hf_hub_download(REPO, "assets/sample_realsense_top_rgb.png")).convert("RGB")
side = Image.open(hf_hub_download(REPO, "assets/sample_realsense_side_rgb.png")).convert("RGB")
state = [-0.527, 189.14, 181.41, 60.64, -3.60, 1.097]

use_transformers(action="call", target="AutoProcessor.from_pretrained",
    parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True}, cache_key="proc")
use_transformers(action="call", target="AutoModelForImageTextToText.from_pretrained",
    parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True, "dtype": "float32"}, cache_key="vla")
print(use_transformers(action="call", target="cached:vla.predict_action",
    parameters={"processor": "cached:proc", "images": [top, side],
                "task": "Pick up the lemon and drop it in the red bowl.",
                "state": state, "norm_tag": "so100_so101_molmoact2",
                "inference_action_mode": "continuous", "num_steps": 10})["content"][0]["text"][:200])
# → MolmoAct2ActionOutput ... actions [1, 30, 6]

Omni audio-in + speech-out needs one bigger model - see examples/omni_audio.py.

Vision tasks on one COCO photo - detection · depth · panoptic segmentation:

object detection, depth estimation, and panoptic segmentation on a single photo

video understanding demo generated speech waveform

🎬 video → label · 🔊 text-to-audio then re-transcribed by whisper (the library narrating itself)
▶️ Hear it speak & play every example on the docs site →

🧩 Two ways to use it

🛠️ As a tool - `use_transformers`

from strands import Agent
from strands_transformers import use_transformers

agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav")                  # automatic-speech-recognition
agent("What's in scene.jpg?")                       # image-text-to-text
agent("Say 'hello from strands' as audio")          # text-to-audio
agent("Detect objects in https://.../street.jpg")   # object-detection

Discover everything at runtime (action="tasks" | "modalities" | "inspect" | …), run high-level pipelines, or call any class / fn / method for custom models. → The tool guide

🧠 As the agent's brain - `TransformerModel`

Pass image / video / audio / document blocks (and media inside a toolResult) - the provider auto-detects the model's processor and routes them.

Content block	Verified output	Example
`image`	`"Green."`	`multimodal_agent.py`
`video` (with `fps`)	`"BRIGHTER."`	`multimodal_advanced.py`
`image` in `toolResult`	`"Blue."`	`multimodal_advanced.py`
`document`	recovers `BANANA-42`	`document_and_audio.py`
`audio` (our schema extension)	audio → text	`audio_content_block.py`
`audio` in and speech out	hears + speaks	`omni_audio.py`

→ Agent brain · Content blocks · Audio

🦾 Robotics / VLA - camera + instruction → actions

Two transformers-native layers, both GPU-verified:

🧠 reason — Cosmos-Reason2-2B (a physical-AI VLM) plans over a scene via run: "the red cube is bottom-left, move the arm there first."
⚙️ act — VLA models expose predict_action via call: MolmoAct2 → [1,30,6]; OpenVLA-7b → 7-DoF (auto 4.x→5.x shims).

🔗 Full agentic loop (robot_reason_act_agent.py): Cosmos plans over real RealSense frames → MolmoAct acts — perception → plan → action through one tool. (Lerobot policies like SmolVLA / π0 / GR00T run their own runtimes — pair with use_lerobot.) → Robotics guide

🌟 Featured models

Examples use tiny models so they run in seconds. Point the same code at any current library_name: transformers model - swap the id, the plumbing is identical:

Modality	Strong open model	How
Vision-language	`Qwen/Qwen3-VL-8B-Instruct` · `google/gemma-3-4b-it`	brain or `run` (image-text-to-text)
Speech → text	`openai/whisper-large-v3-turbo`	`run` (automatic-speech-recognition)
Audio in + speech out	`Qwen/Qwen2.5-Omni-3B`	brain (`speak=True`)
Multimodal (audio+vision+text)	`microsoft/Phi-4-multimodal-instruct`	brain
Robot actions (VLA)	`allenai/MolmoAct2` · `openvla/openvla-7b`	`call` → `predict_action`
Embodied reasoning	`nvidia/Cosmos-Reason2-2B`	`run` (image-text-to-text)

# swap the tiny demo model for a SOTA one - same code:
model = TransformerModel(model_path="Qwen/Qwen3-VL-8B-Instruct")

🏗️ How it works

strands_transformers/
├── tools/use_transformers.py            # the one @tool: discover · run · call
├── models/transformers.py               # TransformerModel - local multimodal brain
├── types/audio.py                       # audio content-block extension
└── core/{registry,engine,io,compat}.py  # taxonomy · load/cache · I/O · legacy shims

Nothing is hardcoded per task - registry.py reads transformers' SUPPORTED_TASKS at runtime, so coverage tracks upstream automatically. → Architecture · API reference

🧪 Examples

Runnable, GPU-verified examples in examples/ - image, video, audio, document, Omni speech, VLA, and pipelines. Run any:

PYTHONPATH=. python examples/<name>.py

→ Examples & FAQ

⭐ Star history

License

MIT - built with the Strands Agents SDK and HuggingFace Transformers.

_{If this saved you a pile of per-model glue code, consider giving it a ⭐}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Jun 15, 2026

0.3.0

Jun 15, 2026

0.2.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_transformers-0.4.0.tar.gz (2.2 MB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strands_transformers-0.4.0-py3-none-any.whl (46.7 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file strands_transformers-0.4.0.tar.gz.

File metadata

Download URL: strands_transformers-0.4.0.tar.gz
Upload date: Jun 15, 2026
Size: 2.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`4c8bc81d07563d91206ac73d549c9a3c36346c0037ec61387aabceb55aa7de71`
MD5	`34d1c41786cecee31b40c7038e5815ab`
BLAKE2b-256	`b21037720e23160eadfdbc1ad81300d71d96699071e5a9a95df687770d65f63f`

See more details on using hashes here.

File details

Details for the file strands_transformers-0.4.0-py3-none-any.whl.

File metadata

Download URL: strands_transformers-0.4.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 46.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60f0da611d24bcfc5a3da12da0cbc9152cc6e3fb6c1ad9b6284ccfbc35ddeec4`
MD5	`dab2bd49e4fc58d0067ab63af518c3d1`
BLAKE2b-256	`66b1e0ab437208f1c817a5c79f9a99397d9dbf2975c37fc98819004e94707765`

See more details on using hashes here.

strands-transformers 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤔 The idea

📦 Install

⚡ 60-second hello

👁️ See it work

🧩 Two ways to use it

🛠️ As a tool - `use_transformers`

🧠 As the agent's brain - `TransformerModel`

🦾 Robotics / VLA - camera + instruction → actions

🌟 Featured models

🏗️ How it works

🧪 Examples

⭐ Star history

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

strands-transformers 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤔 The idea

📦 Install

⚡ 60-second hello

👁️ See it work

🧩 Two ways to use it

🛠️ As a tool - use_transformers

🧠 As the agent's brain - TransformerModel

🦾 Robotics / VLA - camera + instruction → actions

🌟 Featured models

🏗️ How it works

🧪 Examples

⭐ Star history

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🛠️ As a tool - `use_transformers`

🧠 As the agent's brain - `TransformerModel`