Python SDK for SAA, the addressee layer for voice agents — device-directed speech detection that emits a turn_ready event per addressed utterance, so only speech meant for your agent reaches your STT, LLM, or TTS.

These details have not been verified by PyPI

Project links

Project description

attenlabs-saa

Python SDK for Attention Labs real-time selective auditory attention.

Every voice pipeline has the same problem: the microphone hears everything, but your ASR should only process speech directed at the device. Wake words solve this with a rigid trigger phrase. SAA solves it without one, classifying every audio frame as silent, human-directed, or device-directed and routing only what matters.

attenlabs-saa is a thin Apache-2.0 client: it captures and encodes your mic and webcam locally and streams them to the hosted SAA inference server over WebSocket, where the addressee model runs. It emits typed events back, attention predictions, voice activity, conversation state, and ready-to-forward speech audio. The pipeline is audio in, addressee gate, only addressed audio out. It is model-agnostic and drop-in: LLM routing is left to you.

Sign up

Get your API key at attentionlabs.ai.

You need your API key for this project to work

Install

pip install attenlabs-saa

Requires Python 3.10+. sounddevice and opencv-python are pulled in automatically for mic and camera access.

Quickstart

import time
from saa import AttentionClient

client = AttentionClient(token="your-token")

@client.on_prediction
def _(event):
    label = {0: "silent", 1: "human", 2: "device"}.get(event.cls, "?")
    print(f"{label}  {event.confidence:.0%}  faces={event.num_faces}  src={event.source}")

@client.on_turn_ready
def _(turn):
    # turn.audio_base64, base64 PCM16 @ 16 kHz mono, ready for OpenAI Realtime / any LLM
    # turn.audio_pcm16, same audio as np.int16 array
    print(f"turn ready ({turn.duration_sec:.2f}s)")

@client.on_error
def _(event):
    print(f"ERROR: {event.title}: {event.message}")

client.start()
try:
    while True:
        time.sleep(0.1)
except KeyboardInterrupt:
    client.stop()

Run client.start() and you'll see live silent / human / device predictions stream from your own mic in seconds — all you need is your API key (the token= argument above). A full CLI demo wiring SAA + OpenAI Realtime lives at saa-py-demo.

API

`AttentionClient`

from saa import AttentionClient, CameraConfig, MicConfig

client = AttentionClient(
    token="...",                    # API key, sent as WS subprotocol
    url=None,                      # Server URL (default: https://broker.attentionlabs.ai)
    video=CameraConfig(),          # Webcam config
    audio=MicConfig(),             # Mic config
    initial_threshold=0.7,         # Device-class confidence threshold (0..1)
    enable_audio=True,             # Set False to skip mic capture
    enable_video=True,             # Set False to skip webcam capture
    server_profile=None,           # Override server profile; auto "audio_only" when enable_video=False
    auto_reconnect=True,           # Auto-reconnect with backoff after a retriable drop
)

Configuration

`MicConfig`

field	type	default	notes
`device`	`int \| str \| None`	`None`	Device index, name, or `None` for system default
`channels`	`int`	`1`	Number of input channels

`CameraConfig`

field	type	default	notes
`device_index`	`int`	`0`	Webcam device index
`width`	`int`	`1920`	Capture width
`height`	`int`	`1080`	Capture height
`jpeg_quality`	`int`	`50`	JPEG compression quality 0-100

Methods

method	description
`start()`	Probes the camera, opens the WebSocket, acquires mic + camera, starts capture threads. Non-blocking. Raises on handshake failure.
`stop()`	Tears down capture, joins threads, closes WebSocket.
`mute()`	Signals the server to stop feeding your speech to the turn/VAD pipeline
`unmute()`	Resumes server-side turn/VAD processing.
`mark_responding(bool)`	Tell the server an LLM response is in flight. Server stops emitting predictions while `True`.
`set_threshold(value: float)`	Update device-class confidence threshold (0..1). Server acks via `config` event.
`feed_audio(audio, *, sample_rate=16000)`	Stream audio captured by another stack instead of the SDK's own mic. Requires `enable_audio=False`. See Feeding external audio and video.
`feed_video(frame)`	Push an externally-captured frame instead of the SDK's own camera. Requires `enable_video=False`. Accepts pre-encoded JPEG `bytes` or a raw `np.ndarray` (BGR, JPEG-encoded internally). See Feeding external audio and video.

When enable_video=True but the camera can't be opened (missing device, or one held by another app), start() continues audio-only if enable_audio=True. The original enable_video request is restored on the next start(), so a later session retries video.

Feeding external audio and video

When another stack already owns the microphone, an ElevenLabs / OpenAI Realtime AudioInterface tap, a Twilio media stream, a game engine, construct the client with enable_audio=False and push frames in with feed_audio() instead of letting the SDK open its own mic:

client = AttentionClient(token="...", enable_audio=False, enable_video=False)
client.start()                       # opens the WebSocket; captures nothing itself

# in your existing audio callback (any chunk size, mono):
client.feed_audio(pcm_chunk)         # bytes (int16 LE), np.int16, or np.float32 [-1, 1]

feed_audio accepts arbitrary chunk sizes and re-chunks internally to the wire's 100 ms blocks; pass sample_rate= if your audio isn't already 16 kHz and it'll resample. Calling it while enable_audio=True raises (that would double the audio source). A runnable ElevenLabs Conversational AI example lives in saa-sdk/examples/elevenlabs.

Frames work the same way, construct with enable_video=False and push with feed_video():

client = AttentionClient(token="...", enable_video=False)
client.start()

# in your frame callback, pre-encoded JPEG bytes, or a raw BGR np.ndarray:
client.feed_video(jpeg_bytes)        # bytes / bytearray / memoryview, sent as-is
client.feed_video(bgr_frame)         # np.ndarray, JPEG-encoded at CameraConfig.jpeg_quality

Calling feed_video while enable_video=True raises. Mix freely: enable_video=False with the internal mic gives audio-only with external frames, or enable_audio=False + feed_audio() while the SDK still grabs camera frames.

Events

Register handlers with decorators. All callbacks fire on internal threads, keep them fast or hand work off to your own thread.

@client.on_prediction
def handle(event):
    ...

decorator	payload	fires when
`@on_connected`	none	WebSocket opens
`@on_started`	none	Server has loaded the model
`@on_warmup_complete`	none	Model warmed up and producing predictions
`@on_prediction`	`PredictionEvent`	Each attention prediction
`@on_vad`	`VadEvent`	Voice activity update
`@on_state`	`StateEvent`	Conversation state transition
`@on_turn_ready`	`TurnReadyEvent`	Complete user turn ready to forward
`@on_config`	`ConfigEvent`	Server acks a threshold change
`@on_stats`	`StatsEvent`	Every ~10s with connection health
`@on_interrupt`	`InterruptEvent`	User is barging in mid-LLM-response
`@on_interjection`	`InterjectionEvent`	Proactive AI volunteer after humans go quiet
`@on_error`	`AttentionErrorEvent`	Connection, auth, or server error
`@on_disconnected`	`DisconnectedEvent`	WebSocket closes
`@on_reconnecting`	`ReconnectingEvent`	Before each auto-reconnect attempt
`@on_reconnected`	`ReconnectedEvent`	Auto-reconnect succeeded

Event types

`PredictionEvent`

cls: int            # 0 = silent, 1 = human-directed, 2 = device-directed
confidence: float   # 0..1
source: str         # "model" | "rules" | "ai_responding"
num_faces: int      # faces detected in frame
responding: bool    # True while the AI is mid-playback

`VadEvent`

probability: float  # VAD probability 0..1
is_speech: bool     # whether speech was detected

`StateEvent`

state: ConversationState  # "listening" | "sending" | "cancelled" | "idle"

`TurnReadyEvent`

audio_pcm16: np.ndarray   # int16 array @ 16 kHz mono
audio_base64: str          # same audio as base64, ready for OpenAI Realtime, etc.
duration_sec: float        # duration in seconds
frames: list[TurnFrame]    # JPEG stills, empty unless the server has frames_per_turn > 0
context: str | None        # e.g. "interjection_follow_up"; None for normal turns

`ConfigEvent`

model_class2_threshold: float  # server-confirmed threshold

`StatsEvent`

rtt_ms: float | None  # round-trip latency in ms
sent_video: int        # total video frames sent
skipped_video: int     # total video frames skipped
sent_audio: int        # total audio chunks sent
uptime_s: float        # connection uptime in seconds

`InterruptEvent`

fade_ms: int        # suggested fade duration (ms) before stopping playback
confidence: float   # raw model confidence of the class-2 prediction that fired

Fires when the server detects the user trying to take the turn back while the LLM is mid-response. The server has already moved its state machine to listening and pre-rolled the user's recent audio into the next turn, the following turn_ready event will carry the actual barge-in question. The consumer's job is to (a) fade and stop its local LLM playback over fade_ms, (b) cancel any in-flight LLM response, and (c) re-open the mic immediately (do not wait for the fade to finish, or the user's continued speech is dropped for the duration of the fade).

`InterjectionEvent`

reason: str                # why the volunteer fired
audio_pcm16: np.ndarray    # int16 array @ 16 kHz mono (recent conversation audio)
audio_base64: str          # same audio as base64
duration_sec: float        # duration in seconds

Fires when humans chat and then go quiet, so the agent can volunteer a brief check-in. Hand audio_base64 to your LLM as context for the volunteer prompt.

`AttentionErrorEvent`

title: str                  # error category ("Auth Failed", "Connection Stalled", etc.)
message: str                # human-readable message
detail: str | None = None   # technical detail
code: int | None = None     # WebSocket close code, if applicable
kind: str | None = None     # transport | auth | rate_limit | audio | server | environment
retriable: bool = False     # True if the SDK will (or you could) retry

`DisconnectedEvent`

code: int        # WebSocket close code
reason: str      # close reason
was_clean: bool  # True if code == 1000

`ReconnectingEvent`

attempt: int      # 1-based attempt counter
delay_s: float    # backoff delay before this attempt
last_code: int    # close code that triggered the reconnect

`ReconnectedEvent`

attempts: int     # attempts it took to reconnect

LLM integration

LLM routing is intentionally not part of the SDK. The turn_ready event hands you PCM16 audio, both as a NumPy array and as base64, forward it wherever you like.

When your LLM starts generating, call mute() + mark_responding(True) to suppress predictions during playback. When it finishes, unmute() + mark_responding(False).

from saa import AttentionClient

client = AttentionClient(token="...")

@client.on_turn_ready
def _(turn):
    # Forward to your LLM of choice
    your_llm.send(turn.audio_base64)

def on_llm_speaking():
    client.mute()
    client.mark_responding(True)

def on_llm_done():
    client.unmute()
    client.mark_responding(False)

Barge-in (interrupt) handling

When the server detects the user trying to take the turn back while the LLM is speaking, it fires interrupt. Wire it to a fade-and-cancel on your LLM playback layer, then re-open the mic immediately:

@client.on_interrupt
def _(event):
    # Fade your local LLM audio and cancel its in-flight response.
    your_llm.interrupt(event.fade_ms)
    # Re-open the mic immediately, do NOT wait for the fade to finish,
    # or the user's continued speech is dropped for the fade duration.
    client.unmute()
    client.mark_responding(False)

The server has already moved its state machine to listening and pre-rolled the user's recent audio into the chunk accumulator by the time this event arrives. The next turn_ready event will carry the user's actual barge-in question.

See saa-py-demo for a full working example with OpenAI Realtime.

Threading model

The SDK manages these threads internally:

thread	purpose
`saa-ws`	WebSocket send/receive
`saa-heartbeat`	JSON pings every 5s, stats every 10s
`saa-camera-read`	Drains frames off the webcam into a latest-frame buffer
`saa-camera-send`	JPEG-encodes the latest frame and sends it at 4 fps (250 ms)
(sounddevice)	Audio callback at native sample rate, resampled to 16 kHz

Camera capture (when enable_video=True) runs as two threads — a reader and a sender — so a slow encode never stalls frame acquisition. Two more threads spawn transiently: saa-reconnect runs the backoff loop while auto-reconnecting, and saa-clientlog posts the HTTP-beacon fallback for send_client_log().

All event callbacks fire on saa-ws, saa-heartbeat, or saa-reconnect (reconnect events). Don't block them, offload heavy work to your own thread.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7.2

Jul 15, 2026

0.7.1

Jun 25, 2026

0.7.0

Jun 23, 2026

0.6.1

Jun 19, 2026

0.6.0

Jun 19, 2026

0.3.1

May 19, 2026

0.3.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attenlabs_saa-0.7.2.tar.gz (24.2 kB view details)

Uploaded Jul 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

attenlabs_saa-0.7.2-py3-none-any.whl (27.4 kB view details)

Uploaded Jul 15, 2026 Python 3

File details

Details for the file attenlabs_saa-0.7.2.tar.gz.

File metadata

Download URL: attenlabs_saa-0.7.2.tar.gz
Upload date: Jul 15, 2026
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for attenlabs_saa-0.7.2.tar.gz
Algorithm	Hash digest
SHA256	`8a29018946cee6f8c7ca6cdbc220c16a2db40ae5aca66dba4e8482b846ea7276`
MD5	`08889c09c826b4f696fa6e4ff8f9c8f3`
BLAKE2b-256	`28ce08b6fdcc9a43c9a3ebd678bc956d341bb54c848b22252db7da674db01e59`

See more details on using hashes here.

File details

Details for the file attenlabs_saa-0.7.2-py3-none-any.whl.

File metadata

Download URL: attenlabs_saa-0.7.2-py3-none-any.whl
Upload date: Jul 15, 2026
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for attenlabs_saa-0.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cc689937a8ebdd35baa2825f1599a56436bc87cee6a210dadf835b302583c22`
MD5	`1fe32762e9e78c72978a7e5cd72bf1d9`
BLAKE2b-256	`d1ca2ca36b882dd50be98d27fc1095feb5e034c372c95e7c0935e812a571524c`

See more details on using hashes here.

attenlabs-saa 0.7.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

attenlabs-saa

Sign up

Install

Quickstart

API

AttentionClient

Configuration

MicConfig

CameraConfig

Methods

Feeding external audio and video

Events

Event types

PredictionEvent

VadEvent

StateEvent

TurnReadyEvent

ConfigEvent

StatsEvent

InterruptEvent

InterjectionEvent

AttentionErrorEvent

DisconnectedEvent

ReconnectingEvent

ReconnectedEvent

LLM integration

Barge-in (interrupt) handling

Threading model

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`AttentionClient`

`MicConfig`

`CameraConfig`

`PredictionEvent`

`VadEvent`

`StateEvent`

`TurnReadyEvent`

`ConfigEvent`

`StatsEvent`

`InterruptEvent`

`InterjectionEvent`

`AttentionErrorEvent`

`DisconnectedEvent`

`ReconnectingEvent`

`ReconnectedEvent`