Python SDK for Hathora Voice AI API - Speech-to-Text and Text-to-Speech (pip install yapp, import hathora)

These details have not been verified by PyPI

Project links

Project description

Hathora Python SDK

The official Python SDK for the Hathora Voice AI API. Easily integrate speech-to-text (STT) and text-to-speech (TTS) capabilities into your Python applications.

Features

Simple, intuitive API - Clean, Pythonic interface
Multiple TTS models - Kokoro-82M and ResembleAI Chatterbox
Model-specific parameters - Each model has its own unique parameters with validation
Voice cloning with ResembleAI's audio prompt feature
Flexible audio handling - Works with file paths, file objects, or raw bytes
Type hints for better IDE support
Comprehensive error handling

Available Models

Speech-to-Text (STT)

Model	Parameters	Description
Parakeet	`file`	Audio file to transcribe (required, positional)
	`start_time`	Start time in seconds for transcription window (optional)
	`end_time`	End time in seconds for transcription window (optional)

Example:

# Basic usage
client.speech_to_text.convert("parakeet", "audio.wav")

# With time window
client.speech_to_text.convert("parakeet", "audio.wav", start_time=3.0, end_time=9.0)

Text-to-Speech (TTS)

Model	Parameters	Description
Kokoro	`voice`	Voice ID (default: "af_bella")
	`speed`	Speech speed multiplier: 0.5-2.0 (default: 1.0)
ResembleAI	`audio_prompt`	Reference audio file for voice cloning (optional)
	`exaggeration`	Emotional intensity: 0.0-1.0 (default: 0.5)
	`cfg_weight`	Adherence to reference voice: 0.0-1.0 (default: 0.5)

Installation

Install from PyPI:

pip install yapp

Or install from source:

git clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .

Quick Start

import hathora

# Initialize the client
client = hathora.Hathora(api_key="your-api-key")

# Transcribe audio to text
transcription = client.speech_to_text.convert("parakeet", "audio.wav")
print(transcription.text)

# Generate speech from text
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")

Authentication

You can provide your API key in two ways:

1. Pass it directly to the client:

client = hathora.Hathora(api_key="your-api-key")

2. Set it as an environment variable:

export HATHORA_API_KEY="your-api-key"

client = hathora.Hathora()  # Will use HATHORA_API_KEY from environment

Usage Examples

Speech-to-Text (Transcription)

Basic Transcription

The SDK uses the Parakeet multilingual STT model for transcription.

import hathora

client = hathora.Hathora(api_key="your-api-key")

# Transcribe an entire audio file using Parakeet
response = client.speech_to_text.convert("parakeet", "audio.wav")
print(response.text)

Transcription with Time Window

# Transcribe only a specific time range
response = client.speech_to_text.convert(
    "parakeet",   # Model (positional)
    "audio.wav",  # File (positional)
    start_time=3.0,  # Start at 3 seconds
    end_time=9.0     # End at 9 seconds
)
print(response.text)

Multiple Audio Formats

The SDK automatically handles various audio formats:

# From file path (string)
response = client.speech_to_text.convert("parakeet", "audio.wav")

# From pathlib.Path
from pathlib import Path
response = client.speech_to_text.convert("parakeet", Path("audio.mp3"))

# From file object
with open("audio.wav", "rb") as f:
    response = client.speech_to_text.convert("parakeet", f)

# From bytes
audio_bytes = open("audio.wav", "rb").read()
response = client.speech_to_text.convert("parakeet", audio_bytes)

Text-to-Speech (Synthesis)

Using Kokoro-82M Model

Kokoro parameters: voice, speed

import hathora

client = hathora.Hathora(api_key="your-api-key")

# Simple synthesis (uses defaults)
response = client.text_to_speech.convert(
    "kokoro",  # Model first
    "Hello world!"
)
response.save("output.wav")

# With custom voice and speed
response = client.text_to_speech.convert(
    "kokoro",  # Model first
    "The quick brown fox jumps over the lazy dog.",
    voice="af_bella",  # Kokoro parameter
    speed=1.2          # Kokoro parameter - 20% faster
)
response.save("output_fast.wav")

# Or use the kokoro() method directly
response = client.text_to_speech.kokoro(
    text="Direct method call",
    voice="af_bella",
    speed=0.8  # 20% slower
)
response.save("output_slow.wav")

Using ResembleAI Model (with Voice Cloning)

ResembleAI parameters: audio_prompt, exaggeration, cfg_weight

# Simple generation
response = client.text_to_speech.convert(
    "resemble",  # Model first
    "Hello world!",
    exaggeration=0.5,  # Emotional intensity (0.0 - 1.0)
    cfg_weight=0.5     # Adherence to reference voice (0.0 - 1.0)
)
response.save("output.wav")

# Voice cloning with audio prompt
response = client.text_to_speech.convert(
    "resemble",  # Model first
    "This should sound like the reference voice.",
    audio_prompt="reference_voice.wav",  # Reference audio for cloning
    cfg_weight=0.9                       # High adherence to reference
)
response.save("cloned_voice.wav")

# Highly expressive speech
response = client.text_to_speech.convert(
    "resemble",  # Model first
    "Wow! This is amazing!",
    exaggeration=0.9,  # High emotional intensity
    cfg_weight=0.5
)
response.save("expressive.wav")

# Or use the resemble() method directly
response = client.text_to_speech.resemble(
    text="Direct method call",
    audio_prompt="reference.wav",
    exaggeration=0.7,
    cfg_weight=0.8
)
response.save("output.wav")

Discovering Model Parameters

The SDK provides methods to discover what parameters are available for each TTS model:

# Parakeet (STT) parameters
# Model: parakeet
# Parameters:
#   - file (required): Audio file to transcribe
#   - start_time (optional): Start time in seconds
#   - end_time (optional): End time in seconds
client.speech_to_text.convert("parakeet", "audio.wav", start_time=0, end_time=10)

# List all available TTS models
models = client.text_to_speech.list_models()
print(models)  # ['kokoro', 'resemble']

# Print help for a specific TTS model
client.text_to_speech.print_model_help("kokoro")
# Output:
# Model: kokoro
# Parameters:
#   - voice (str, default='af_bella'): Voice to use for synthesis
#   - speed (float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)

client.text_to_speech.print_model_help("resemble")
# Output:
# Model: resemble
# Parameters:
#   - audio_prompt (AudioFile, default=None): Reference audio file for voice cloning (optional)
#   - exaggeration (float, default=0.5): Emotional intensity, range 0.0-1.0
#   - cfg_weight (float, default=0.5): Adherence to reference voice, range 0.0-1.0

# Get parameter specifications programmatically
params = client.text_to_speech.get_model_parameters("kokoro")
for param_name, param_info in params.items():
    print(f"{param_name}: {param_info['description']}")

Parameter Validation

The SDK validates that you're using the correct parameters for each model:

# This works - correct Kokoro parameters
response = client.text_to_speech.convert(
    "kokoro", "Hello", voice="af_bella", speed=1.2
)

# This raises ValidationError with helpful message
try:
    response = client.text_to_speech.convert(
        "resemble", "Hello", speed=1.2  # ERROR!
    )
except ValidationError as e:
    print(e)
    # Output: Unknown parameters for ResembleAI model: speed.
    #         Valid parameters: audio_prompt, exaggeration, cfg_weight
    #         Use client.text_to_speech.print_model_help('resemble') for more details.

# This also raises ValidationError
response = client.text_to_speech.convert(
    "kokoro", "Hello", exaggeration=0.5  # ERROR!
)

Working with Audio Responses

# Save to file
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")

# Or use stream_to_file (alias for save)
response.stream_to_file("output.wav")

# Get raw bytes
audio_bytes = response.content
print(f"Generated {len(audio_bytes)} bytes")

# Check content type
print(response.content_type)  # e.g., "audio/wav"

API Reference

`hathora.Hathora`

Main client class for the Hathora API.

Parameters:

api_key (str, optional): Your Hathora API key
timeout (int, default=30): Request timeout in seconds

Properties:

speech_to_text: Speech-to-text (STT) resource for audio transcription
text_to_speech: Text-to-speech (TTS) resource for audio synthesis

`client.speech_to_text.convert()`

Transcribe audio to text using the Parakeet STT model.

Parameters:

model (str): STT model to use (currently: "parakeet") - positional, required
file (str | Path | BinaryIO | bytes): Audio file to transcribe - positional, required
start_time (float, optional): Start time in seconds for transcription window
end_time (float, optional): End time in seconds for transcription window
**kwargs: Additional model-specific parameters (reserved for future use)

Example:

# Both model and file are positional
response = client.speech_to_text.convert("parakeet", "audio.wav")

Available Models:

"parakeet" - nvidia/parakeet-tdt-0.6b-v3 - Multilingual ASR with word-level timestamps

Returns: TranscriptionResponse

.text: The transcribed text
.metadata: Additional metadata from the API (may include word-level timestamps)

Supported audio formats: WAV, MP3, MP4, M4A, OGG, FLAC, PCM

`client.text_to_speech.convert()`

Generate speech from text. This is a unified interface that routes to the appropriate model.

Parameters:

model (str): Model to use ("kokoro" or "resemble") - required, first parameter
text (str): Text to convert to speech
**kwargs: Model-specific parameters (see below)

Model-Specific Parameters:

For Kokoro model:

voice (str, default="af_bella"): Voice to use for synthesis
speed (float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)

For ResembleAI model:

audio_prompt (str | Path | BinaryIO | bytes, optional): Reference audio for voice cloning
exaggeration (float, default=0.5): Emotional intensity, range 0.0-1.0
cfg_weight (float, default=0.5): Adherence to reference voice, range 0.0-1.0

Returns: AudioResponse

Examples:

# Kokoro - model comes first!
response = client.text_to_speech.convert(
    "kokoro", "Hello", voice="af_bella", speed=1.2
)

# ResembleAI - model comes first!
response = client.text_to_speech.convert(
    "resemble", "Hello", exaggeration=0.7, cfg_weight=0.6
)

See also: Use print_model_help() to discover parameters

`client.text_to_speech.list_models()`

List all available TTS models.

Returns: list - List of model names

Example:

models = client.text_to_speech.list_models()
print(models)  # ['kokoro', 'resemble']

`client.text_to_speech.get_model_parameters()`

Get parameter specifications for a specific model.

Parameters:

model (str): Model name

Returns: dict - Parameter specifications with types, defaults, and descriptions

Example:

params = client.text_to_speech.get_model_parameters("kokoro")
for name, info in params.items():
    print(f"{name}: {info['description']}")

`client.text_to_speech.print_model_help()`

Print helpful information about a model's parameters to console.

Parameters:

model (str): Model name

Example:

client.text_to_speech.print_model_help("kokoro")
# Prints:
# Model: kokoro
# Parameters:
#   - voice (str, default='af_bella'): Voice to use for synthesis
#   - speed (float, default=1.0): Speech speed multiplier...

`client.text_to_speech.kokoro()`

Generate speech using the Kokoro-82M model.

Parameters:

text (str): Text to convert to speech
voice (str, default="af_bella"): Voice to use
speed (float, default=1.0): Speech speed multiplier

Returns: AudioResponse

`client.text_to_speech.resemble()`

Generate speech using ResembleAI Chatterbox with voice cloning.

Parameters:

text (str): Text to convert to speech
audio_prompt (str | Path | BinaryIO | bytes, optional): Reference audio for voice cloning
exaggeration (float, default=0.5): Emotional intensity (0.0 - 1.0)
cfg_weight (float, default=0.5): Adherence to reference voice (0.0 - 1.0)

Returns: AudioResponse

`AudioResponse`

Response object containing generated audio.

Properties:

content: Raw audio bytes
content_type: MIME type of the audio

Methods:

save(file_path): Save audio to file
stream_to_file(file_path): Alias for save()

`TranscriptionResponse`

Response object containing transcribed text.

Properties:

text: The transcribed text
metadata: Additional metadata

Complete Workflow Example

import hathora

# Initialize client
client = hathora.Hathora(api_key="your-api-key")

# 1. Transcribe audio
transcription = client.speech_to_text.convert(
    "parakeet",      # Model (positional)
    "original.wav",  # File (positional)
    start_time=0,
    end_time=10
)
print(f"Original: {transcription.text}")

# 2. Modify the text
modified_text = transcription.text.upper()

# 3. Generate new speech with Kokoro
response = client.text_to_speech.convert(
    "kokoro", modified_text, voice="af_bella", speed=1.0
)
response.save("output_kokoro.wav")

# 4. Clone voice from original audio
cloned = client.text_to_speech.convert(
    "resemble", "New text in the original voice",
    audio_prompt="original.wav", cfg_weight=0.9
)
cloned.save("cloned_voice.wav")

Error Handling

The SDK provides specific exception types for different error scenarios:

from yapp import HathoraError, APIError, AuthenticationError, ValidationError

try:
    response = client.text_to_speech.convert("kokoro", "Hello world!")
    response.save("output.wav")
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
except ValidationError as e:
    print(f"Invalid parameters: {e}")
except APIError as e:
    print(f"API error (status {e.status_code}): {e.message}")
except HathoraError as e:
    print(f"Hathora SDK error: {e}")

Supported Audio Formats

Input (Transcription)

WAV (.wav)
MP3 (.mp3)
MP4 Audio (.mp4, .m4a)
OGG (.ogg)
FLAC (.flac)
PCM (.pcm)

Output (Synthesis)

WAV (default output format)

Development

Running Examples

cd examples
python discover_parameters.py   # Learn about model parameters
python transcribe_audio.py      # Speech-to-text examples
python synthesize_speech.py     # Text-to-speech examples
python voice_cloning.py         # Voice cloning with ResembleAI
python model_parameters.py      # Model-specific parameter examples
python full_workflow.py         # Complete workflow

Installing for Development

git clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .

Roadmap

Add streaming support for real-time TTS
Support for additional TTS models
Async client support
Audio format conversion utilities
Batch processing capabilities
WebSocket support for real-time conversations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

GitHub Issues: https://github.com/hathora/yapp-sdk/issues
Documentation: https://docs.hathora.com
Email: support@hathora.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Oct 29, 2025

0.2.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yapp-0.3.0.tar.gz (19.1 kB view details)

Uploaded Oct 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yapp-0.3.0-py3-none-any.whl (15.9 kB view details)

Uploaded Oct 29, 2025 Python 3

File details

Details for the file yapp-0.3.0.tar.gz.

File metadata

Download URL: yapp-0.3.0.tar.gz
Upload date: Oct 29, 2025
Size: 19.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for yapp-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`e06d3ba60dcfd277445ea60f303b1b95ca1aee8d5e97bbd8e43dea13321d7c6d`
MD5	`86eccaec62fd6262e8215dd79857b6d8`
BLAKE2b-256	`49c82d085787cd197de164c3c28b2822ce70773a39b84abdc0aea62307edf761`

See more details on using hashes here.

File details

Details for the file yapp-0.3.0-py3-none-any.whl.

File metadata

Download URL: yapp-0.3.0-py3-none-any.whl
Upload date: Oct 29, 2025
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for yapp-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdea6e1b1e1517cabb2b02dccf9d5faa50c1fdf8b1cf7eca279c74460e8d8718`
MD5	`29e9455f8b01bce11b58b02d5f7317ed`
BLAKE2b-256	`c30d7b356eb3f07d64c41939fa0527bd26b1ee9076ffc4145d20ddec955ee5e0`

See more details on using hashes here.

yapp 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hathora Python SDK

Features

Available Models

Speech-to-Text (STT)

Text-to-Speech (TTS)

Installation

Quick Start

Authentication

1. Pass it directly to the client:

2. Set it as an environment variable:

Usage Examples

Speech-to-Text (Transcription)

Basic Transcription

Transcription with Time Window

Multiple Audio Formats

Text-to-Speech (Synthesis)

Using Kokoro-82M Model

Using ResembleAI Model (with Voice Cloning)

Discovering Model Parameters

Parameter Validation

Working with Audio Responses

API Reference

hathora.Hathora

client.speech_to_text.convert()

client.text_to_speech.convert()

client.text_to_speech.list_models()

client.text_to_speech.get_model_parameters()

client.text_to_speech.print_model_help()

client.text_to_speech.kokoro()

client.text_to_speech.resemble()

AudioResponse

TranscriptionResponse

Complete Workflow Example

Error Handling

Supported Audio Formats

Input (Transcription)

Output (Synthesis)

Development

Running Examples

Installing for Development

Roadmap

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`hathora.Hathora`

`client.speech_to_text.convert()`

`client.text_to_speech.convert()`

`client.text_to_speech.list_models()`

`client.text_to_speech.get_model_parameters()`

`client.text_to_speech.print_model_help()`

`client.text_to_speech.kokoro()`

`client.text_to_speech.resemble()`

`AudioResponse`

`TranscriptionResponse`