Python SDK for Hathora Voice AI API - Speech-to-Text and Text-to-Speech (pip install yapp, import hathora)
Project description
Hathora Python SDK
The official Python SDK for the Hathora Voice AI API. Easily integrate speech-to-text (STT) and text-to-speech (TTS) capabilities into your Python applications.
Features
- Simple, intuitive API - Clean, Pythonic interface
- Multiple TTS models - Kokoro-82M and ResembleAI Chatterbox
- Model-specific parameters - Each model has its own unique parameters with validation
- Voice cloning with ResembleAI's audio prompt feature
- Flexible audio handling - Works with file paths, file objects, or raw bytes
- Type hints for better IDE support
- Comprehensive error handling
Available Models
Speech-to-Text (STT)
| Model | Parameters | Description |
|---|---|---|
| Parakeet | file |
Audio file to transcribe (required, positional) |
start_time |
Start time in seconds for transcription window (optional) | |
end_time |
End time in seconds for transcription window (optional) |
Example:
# Basic usage
client.speech_to_text.convert("parakeet", "audio.wav")
# With time window
client.speech_to_text.convert("parakeet", "audio.wav", start_time=3.0, end_time=9.0)
Text-to-Speech (TTS)
| Model | Parameters | Description |
|---|---|---|
| Kokoro | voice |
Voice ID (default: "af_bella") |
speed |
Speech speed multiplier: 0.5-2.0 (default: 1.0) | |
| ResembleAI | audio_prompt |
Reference audio file for voice cloning (optional) |
exaggeration |
Emotional intensity: 0.0-1.0 (default: 0.5) | |
cfg_weight |
Adherence to reference voice: 0.0-1.0 (default: 0.5) |
Installation
Install from PyPI:
pip install yapp
Or install from source:
git clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .
Quick Start
import hathora
# Initialize the client
client = hathora.Hathora(api_key="your-api-key")
# Transcribe audio to text
transcription = client.speech_to_text.convert("parakeet", "audio.wav")
print(transcription.text)
# Generate speech from text
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")
Authentication
You can provide your API key in two ways:
1. Pass it directly to the client:
client = hathora.Hathora(api_key="your-api-key")
2. Set it as an environment variable:
export HATHORA_API_KEY="your-api-key"
client = hathora.Hathora() # Will use HATHORA_API_KEY from environment
Usage Examples
Speech-to-Text (Transcription)
Basic Transcription
The SDK uses the Parakeet multilingual STT model for transcription.
import hathora
client = hathora.Hathora(api_key="your-api-key")
# Transcribe an entire audio file using Parakeet
response = client.speech_to_text.convert("parakeet", "audio.wav")
print(response.text)
Transcription with Time Window
# Transcribe only a specific time range
response = client.speech_to_text.convert(
"parakeet", # Model (positional)
"audio.wav", # File (positional)
start_time=3.0, # Start at 3 seconds
end_time=9.0 # End at 9 seconds
)
print(response.text)
Multiple Audio Formats
The SDK automatically handles various audio formats:
# From file path (string)
response = client.speech_to_text.convert("parakeet", "audio.wav")
# From pathlib.Path
from pathlib import Path
response = client.speech_to_text.convert("parakeet", Path("audio.mp3"))
# From file object
with open("audio.wav", "rb") as f:
response = client.speech_to_text.convert("parakeet", f)
# From bytes
audio_bytes = open("audio.wav", "rb").read()
response = client.speech_to_text.convert("parakeet", audio_bytes)
Text-to-Speech (Synthesis)
Using Kokoro-82M Model
Kokoro parameters: voice, speed
import hathora
client = hathora.Hathora(api_key="your-api-key")
# Simple synthesis (uses defaults)
response = client.text_to_speech.convert(
"kokoro", # Model first
"Hello world!"
)
response.save("output.wav")
# With custom voice and speed
response = client.text_to_speech.convert(
"kokoro", # Model first
"The quick brown fox jumps over the lazy dog.",
voice="af_bella", # Kokoro parameter
speed=1.2 # Kokoro parameter - 20% faster
)
response.save("output_fast.wav")
# Or use the kokoro() method directly
response = client.text_to_speech.kokoro(
text="Direct method call",
voice="af_bella",
speed=0.8 # 20% slower
)
response.save("output_slow.wav")
Using ResembleAI Model (with Voice Cloning)
ResembleAI parameters: audio_prompt, exaggeration, cfg_weight
# Simple generation
response = client.text_to_speech.convert(
"resemble", # Model first
"Hello world!",
exaggeration=0.5, # Emotional intensity (0.0 - 1.0)
cfg_weight=0.5 # Adherence to reference voice (0.0 - 1.0)
)
response.save("output.wav")
# Voice cloning with audio prompt
response = client.text_to_speech.convert(
"resemble", # Model first
"This should sound like the reference voice.",
audio_prompt="reference_voice.wav", # Reference audio for cloning
cfg_weight=0.9 # High adherence to reference
)
response.save("cloned_voice.wav")
# Highly expressive speech
response = client.text_to_speech.convert(
"resemble", # Model first
"Wow! This is amazing!",
exaggeration=0.9, # High emotional intensity
cfg_weight=0.5
)
response.save("expressive.wav")
# Or use the resemble() method directly
response = client.text_to_speech.resemble(
text="Direct method call",
audio_prompt="reference.wav",
exaggeration=0.7,
cfg_weight=0.8
)
response.save("output.wav")
Discovering Model Parameters
The SDK provides methods to discover what parameters are available for each TTS model:
# Parakeet (STT) parameters
# Model: parakeet
# Parameters:
# - file (required): Audio file to transcribe
# - start_time (optional): Start time in seconds
# - end_time (optional): End time in seconds
client.speech_to_text.convert("parakeet", "audio.wav", start_time=0, end_time=10)
# List all available TTS models
models = client.text_to_speech.list_models()
print(models) # ['kokoro', 'resemble']
# Print help for a specific TTS model
client.text_to_speech.print_model_help("kokoro")
# Output:
# Model: kokoro
# Parameters:
# - voice (str, default='af_bella'): Voice to use for synthesis
# - speed (float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)
client.text_to_speech.print_model_help("resemble")
# Output:
# Model: resemble
# Parameters:
# - audio_prompt (AudioFile, default=None): Reference audio file for voice cloning (optional)
# - exaggeration (float, default=0.5): Emotional intensity, range 0.0-1.0
# - cfg_weight (float, default=0.5): Adherence to reference voice, range 0.0-1.0
# Get parameter specifications programmatically
params = client.text_to_speech.get_model_parameters("kokoro")
for param_name, param_info in params.items():
print(f"{param_name}: {param_info['description']}")
Parameter Validation
The SDK validates that you're using the correct parameters for each model:
# This works - correct Kokoro parameters
response = client.text_to_speech.convert(
"kokoro", "Hello", voice="af_bella", speed=1.2
)
# This raises ValidationError with helpful message
try:
response = client.text_to_speech.convert(
"resemble", "Hello", speed=1.2 # ERROR!
)
except ValidationError as e:
print(e)
# Output: Unknown parameters for ResembleAI model: speed.
# Valid parameters: audio_prompt, exaggeration, cfg_weight
# Use client.text_to_speech.print_model_help('resemble') for more details.
# This also raises ValidationError
response = client.text_to_speech.convert(
"kokoro", "Hello", exaggeration=0.5 # ERROR!
)
Working with Audio Responses
# Save to file
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")
# Or use stream_to_file (alias for save)
response.stream_to_file("output.wav")
# Get raw bytes
audio_bytes = response.content
print(f"Generated {len(audio_bytes)} bytes")
# Check content type
print(response.content_type) # e.g., "audio/wav"
API Reference
hathora.Hathora
Main client class for the Hathora API.
Parameters:
api_key(str, optional): Your Hathora API keytimeout(int, default=30): Request timeout in seconds
Properties:
speech_to_text: Speech-to-text (STT) resource for audio transcriptiontext_to_speech: Text-to-speech (TTS) resource for audio synthesis
client.speech_to_text.convert()
Transcribe audio to text using the Parakeet STT model.
Parameters:
model(str): STT model to use (currently: "parakeet") - positional, requiredfile(str | Path | BinaryIO | bytes): Audio file to transcribe - positional, requiredstart_time(float, optional): Start time in seconds for transcription windowend_time(float, optional): End time in seconds for transcription window**kwargs: Additional model-specific parameters (reserved for future use)
Example:
# Both model and file are positional
response = client.speech_to_text.convert("parakeet", "audio.wav")
Available Models:
"parakeet"- nvidia/parakeet-tdt-0.6b-v3 - Multilingual ASR with word-level timestamps
Returns: TranscriptionResponse
.text: The transcribed text.metadata: Additional metadata from the API (may include word-level timestamps)
Supported audio formats: WAV, MP3, MP4, M4A, OGG, FLAC, PCM
client.text_to_speech.convert()
Generate speech from text. This is a unified interface that routes to the appropriate model.
Parameters:
model(str): Model to use ("kokoro" or "resemble") - required, first parametertext(str): Text to convert to speech**kwargs: Model-specific parameters (see below)
Model-Specific Parameters:
For Kokoro model:
voice(str, default="af_bella"): Voice to use for synthesisspeed(float, default=1.0): Speech speed multiplier (0.5 = half speed, 2.0 = double speed)
For ResembleAI model:
audio_prompt(str | Path | BinaryIO | bytes, optional): Reference audio for voice cloningexaggeration(float, default=0.5): Emotional intensity, range 0.0-1.0cfg_weight(float, default=0.5): Adherence to reference voice, range 0.0-1.0
Returns: AudioResponse
Examples:
# Kokoro - model comes first!
response = client.text_to_speech.convert(
"kokoro", "Hello", voice="af_bella", speed=1.2
)
# ResembleAI - model comes first!
response = client.text_to_speech.convert(
"resemble", "Hello", exaggeration=0.7, cfg_weight=0.6
)
See also: Use print_model_help() to discover parameters
client.text_to_speech.list_models()
List all available TTS models.
Returns: list - List of model names
Example:
models = client.text_to_speech.list_models()
print(models) # ['kokoro', 'resemble']
client.text_to_speech.get_model_parameters()
Get parameter specifications for a specific model.
Parameters:
model(str): Model name
Returns: dict - Parameter specifications with types, defaults, and descriptions
Example:
params = client.text_to_speech.get_model_parameters("kokoro")
for name, info in params.items():
print(f"{name}: {info['description']}")
client.text_to_speech.print_model_help()
Print helpful information about a model's parameters to console.
Parameters:
model(str): Model name
Example:
client.text_to_speech.print_model_help("kokoro")
# Prints:
# Model: kokoro
# Parameters:
# - voice (str, default='af_bella'): Voice to use for synthesis
# - speed (float, default=1.0): Speech speed multiplier...
client.text_to_speech.kokoro()
Generate speech using the Kokoro-82M model.
Parameters:
text(str): Text to convert to speechvoice(str, default="af_bella"): Voice to usespeed(float, default=1.0): Speech speed multiplier
Returns: AudioResponse
client.text_to_speech.resemble()
Generate speech using ResembleAI Chatterbox with voice cloning.
Parameters:
text(str): Text to convert to speechaudio_prompt(str | Path | BinaryIO | bytes, optional): Reference audio for voice cloningexaggeration(float, default=0.5): Emotional intensity (0.0 - 1.0)cfg_weight(float, default=0.5): Adherence to reference voice (0.0 - 1.0)
Returns: AudioResponse
AudioResponse
Response object containing generated audio.
Properties:
content: Raw audio bytescontent_type: MIME type of the audio
Methods:
save(file_path): Save audio to filestream_to_file(file_path): Alias forsave()
TranscriptionResponse
Response object containing transcribed text.
Properties:
text: The transcribed textmetadata: Additional metadata
Complete Workflow Example
import hathora
# Initialize client
client = hathora.Hathora(api_key="your-api-key")
# 1. Transcribe audio
transcription = client.speech_to_text.convert(
"parakeet", # Model (positional)
"original.wav", # File (positional)
start_time=0,
end_time=10
)
print(f"Original: {transcription.text}")
# 2. Modify the text
modified_text = transcription.text.upper()
# 3. Generate new speech with Kokoro
response = client.text_to_speech.convert(
"kokoro", modified_text, voice="af_bella", speed=1.0
)
response.save("output_kokoro.wav")
# 4. Clone voice from original audio
cloned = client.text_to_speech.convert(
"resemble", "New text in the original voice",
audio_prompt="original.wav", cfg_weight=0.9
)
cloned.save("cloned_voice.wav")
Error Handling
The SDK provides specific exception types for different error scenarios:
from yapp import HathoraError, APIError, AuthenticationError, ValidationError
try:
response = client.text_to_speech.convert("kokoro", "Hello world!")
response.save("output.wav")
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except ValidationError as e:
print(f"Invalid parameters: {e}")
except APIError as e:
print(f"API error (status {e.status_code}): {e.message}")
except HathoraError as e:
print(f"Hathora SDK error: {e}")
Supported Audio Formats
Input (Transcription)
- WAV (.wav)
- MP3 (.mp3)
- MP4 Audio (.mp4, .m4a)
- OGG (.ogg)
- FLAC (.flac)
- PCM (.pcm)
Output (Synthesis)
- WAV (default output format)
Development
Running Examples
cd examples
python discover_parameters.py # Learn about model parameters
python transcribe_audio.py # Speech-to-text examples
python synthesize_speech.py # Text-to-speech examples
python voice_cloning.py # Voice cloning with ResembleAI
python model_parameters.py # Model-specific parameter examples
python full_workflow.py # Complete workflow
Installing for Development
git clone https://github.com/hathora/yapp-sdk.git
cd yapp-sdk
pip install -e .
Roadmap
- Add streaming support for real-time TTS
- Support for additional TTS models
- Async client support
- Audio format conversion utilities
- Batch processing capabilities
- WebSocket support for real-time conversations
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
- GitHub Issues: https://github.com/hathora/yapp-sdk/issues
- Documentation: https://docs.hathora.com
- Email: support@hathora.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yapp-0.3.0.tar.gz.
File metadata
- Download URL: yapp-0.3.0.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e06d3ba60dcfd277445ea60f303b1b95ca1aee8d5e97bbd8e43dea13321d7c6d
|
|
| MD5 |
86eccaec62fd6262e8215dd79857b6d8
|
|
| BLAKE2b-256 |
49c82d085787cd197de164c3c28b2822ce70773a39b84abdc0aea62307edf761
|
File details
Details for the file yapp-0.3.0-py3-none-any.whl.
File metadata
- Download URL: yapp-0.3.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdea6e1b1e1517cabb2b02dccf9d5faa50c1fdf8b1cf7eca279c74460e8d8718
|
|
| MD5 |
29e9455f8b01bce11b58b02d5f7317ed
|
|
| BLAKE2b-256 |
c30d7b356eb3f07d64c41939fa0527bd26b1ee9076ffc4145d20ddec955ee5e0
|