Skip to main content

The universal entrypoint to HuggingFace transformers for Strands agents — 100% task & modality coverage, zero hardcoding.

Project description

🤗 Strands Transformers

One tool wraps all of HuggingFace transformers. One provider makes any local model a multimodal agent brain.

Agents that see, hear, and speak — 100% task coverage, zero hardcoding, fully local.

pypi docs issues python transformers modalities license

use_aws wraps all of boto3. use_lerobot wraps all of lerobot. use_transformers wraps all of HuggingFace transformers — every task, every modality, in one tool that reads transformers' own taxonomy at runtime (new task upstream ⇒ supported here with no code change). And TransformerModel makes any local HF model a drop-in Strands brain that speaks the full content-block protocol — image, video, audio, document. With Qwen2.5-Omni it even speaks back.

flowchart LR
    IN["📥 text · image · video<br/>audio · document · robot-state"]
    TOOL["🛠️ use_transformers<br/><i>tool</i>"]
    BRAIN["🧠 TransformerModel<br/><i>local agent brain</i>"]
    OUT["📤 text · speech · image<br/>labels · actions"]
    IN --> TOOL --> OUT
    IN --> BRAIN --> OUT
    classDef i fill:#7C4DFF,stroke:#5b34d6,color:#fff;
    classDef c fill:#FFD21E,stroke:#E68A00,color:#3a2d00;
    classDef o fill:#00E5FF,stroke:#00b3cc,color:#003844;
    class IN i;
    class TOOL,BRAIN c;
    class OUT o;

📖 Full documentation →  ·  built with MkDocs (docs/)

Install

uv pip install strands-transformers        # from PyPI
# or from source:
uv pip install -e .                         # or: pip install -e .
PYTHONPATH=. python examples/smoke.py       # verify → "12/12 checks passed"
Optional extras (audio · vision · training · docs)
uv pip install -e ".[audio]"      # soundfile, librosa  (mp3/flac/ogg decode)
uv pip install -e ".[vision]"     # opencv, av  (video)
uv pip install -e ".[training]"   # trl, peft, accelerate
uv pip install -e ".[docs]"       # mkdocs-material, mkdocstrings
uv pip install -e ".[all]"        # everything

WAV audio works without extras. device="auto" picks cuda → mps → cpu (bf16 on GPU).

60-second hello — a local vision agent

import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG")  # green square

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")

print(agent([
    {"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
    {"text": "Color? One word."},
]))
# → Green.

A 256M-param model in the standard Strands loop, seeing pixels through a content block — no API key, no server. Swap model_path for any HF VLM.

See it work

Every output below is a real model result (CUDA · transformers 5.12 · torch 2.10):

You give it Script It returns
🖼️ a green image + "Color?" examples/multimodal_agent.py "Green."
🎬 brightening frames examples/multimodal_advanced.py "BRIGHTER."
🧰 a tool screenshot (blue) examples/multimodal_advanced.py "Blue."
📄 a text document examples/document_and_audio.py recovers BANANA-42
🔊 a 440 Hz tone (Omni) examples/omni_audio.py "It's a pure tone."
💬 "say: …can speak" (Omni) examples/omni_audio.py 🔊 real 24 kHz speech

▶️ Hear Omni speak + see all diagrams in the docs →

Two ways to use it

As a tooluse_transformers (discover · run · call)
from strands import Agent
from strands_transformers import use_transformers

agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav")                  # automatic-speech-recognition
agent("What's in scene.jpg?")                       # image-text-to-text
agent("Say 'hello from strands' as audio")          # text-to-audio
agent("Detect objects in https://.../street.jpg")   # object-detection

Discover everything at runtime (action="tasks" | "modalities" | "inspect" | …), run high-level pipelines, or call any class/fn/method for custom models. → The tool guide

As the agent's brainTransformerModel (multimodal content blocks)

Pass image / video / audio / document content blocks (and media inside a toolResult) — the provider auto-detects the model's processor and routes them. All outputs below are real results (CUDA, transformers 5.12 / torch 2.10):

Content block Example Verified output
image multimodal_agent.py "Green."
video (with fps) multimodal_advanced.py "BRIGHTER."
image in toolResult multimodal_advanced.py "Blue."
document document_and_audio.py recovers BANANA-42
audio (our schema extension) audio_content_block.py audio → text
audio in and speech out omni_audio.py hears + speaks (Qwen2.5-Omni)

Agent brain · Content blocks · Audio

Robotics / VLA — camera + instruction → robot actions

Two layers, both transformers-native and GPU-verified:

  • 🧠 reasonCosmos-Reason2-2B (a physical-AI VLM) plans over a scene via the run path: "the red cube is in the bottom left corner, so the arm should move there first."
  • ⚙️ act — VLA models expose predict_action via the call path: MolmoAct2[1,30,6]; OpenVLA-7b → 7-DoF (auto 4.x→5.x shims).

🔗 Full agentic loop (examples/robot_reason_act_agent.py): Cosmos-Reason plans over real RealSense frames → MolmoAct acts ([1,30,6]) — perception→plan→action through one tool.

Lerobot-ecosystem policies (SmolVLA, π0, ACT, GR00T) use their own runtimes — pair with use_lerobot. → Robotics guide

How it works

Nothing is hardcoded per task — core/registry.py reads transformers' own SUPPORTED_TASKS at runtime, so coverage tracks upstream automatically.

Project layout
strands_transformers/
├── tools/use_transformers.py   # the one @tool: discover · run · call
├── models/transformers.py      # TransformerModel — local multimodal agent brain
├── types/audio.py              # audio content-block extension
└── core/{registry,engine,io,compat}.py   # taxonomy · load/cache · I/O · legacy shims

Architecture · API reference

Examples

12 runnable, GPU-verified examples in examples/ — image, video, audio, document, Omni speech, VLA, and pipelines. Run any:

PYTHONPATH=. python examples/<name>.py

Examples & FAQ

License

MIT — built with Strands Agents SDK and HuggingFace Transformers.

If this saved you a pile of per-model glue code, consider giving it a ⭐

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_transformers-0.2.0.tar.gz (243.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strands_transformers-0.2.0-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file strands_transformers-0.2.0.tar.gz.

File metadata

  • Download URL: strands_transformers-0.2.0.tar.gz
  • Upload date:
  • Size: 243.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.2.0.tar.gz
Algorithm Hash digest
SHA256 056ac9033173080376c958873f6e3f885f3bfaefce86339f2a91f4e76e80fc6d
MD5 cb59e45822d484f0d4ca392a5667de02
BLAKE2b-256 dd1e10b2382d93d40676fa561d13c6bb6ad1053a0d9c740dee017d13204f0e33

See more details on using hashes here.

File details

Details for the file strands_transformers-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for strands_transformers-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e5fdd8638273d0568c1290d74010e13ce7f538c65a719085096fb217b5ca89
MD5 58290ef4df10403c94ae1efe4aacaacd
BLAKE2b-256 00bd54726376fa4788b9cacfc0eadc7b864dd1d6ef27849f2c406a6763889fdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page