Unified SDK for speech operations (ASR/TTS) with streaming support across multiple providers
Project description
LiteSpeech
Unified SDK for speech operations (ASR/TTS) with streaming support across multiple providers.
LiteSpeech provides a consistent interface for text-to-speech and speech-to-text across providers like ElevenLabs, Deepgram, Cartesia, OpenAI, and Azure. It features first-class support for streaming and seamless integration with LLM outputs.
Table of Contents
- Features
- Installation
- Quick Start
- Provider String Format
- Supported Providers
- API Reference
- LLM Integration
- Audio Processing
- ASR Streaming Results
- Configuration
- Provider-Specific Details
- Error Handling
- Examples
- Development
- License
Features
- Multi-Provider Support: ElevenLabs, Deepgram, Cartesia, OpenAI, Azure Speech Services
- Streaming-First: True streaming TTS and ASR where supported
- LLM Integration: Auto-detect and pipe OpenAI/Anthropic/LiteLLM streams to TTS
- Unified API: Same interface across all providers
- Sync + Async: Primary async interface with sync wrapper
- Audio Preprocessing: Auto-detect and convert audio formats
- Interim Results: Real-time partial transcriptions with clear final/interim marking
- Deduplication: Smart filtering of duplicate transcripts in streaming ASR
Installation
pip install litespeech
With audio conversion support (recommended for format conversion):
pip install litespeech[audio]
With development dependencies:
pip install litespeech[dev]
Quick Start
Text-to-Speech
from litespeech import LiteSpeech
import asyncio
async def main():
ls = LiteSpeech()
# Batch TTS
audio = await ls.text_to_speech(
text="Hello, world!",
provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb"
)
with open("output.mp3", "wb") as f:
f.write(audio)
# Streaming TTS
async for chunk in ls.text_to_speech_stream(
text="Hello, this is streaming TTS!",
provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb",
output_format="pcm_16000"
):
# Play or process audio chunk
pass
asyncio.run(main())
Speech-to-Text
from litespeech import LiteSpeech
import asyncio
async def main():
ls = LiteSpeech()
# Batch ASR
text = await ls.speech_to_text(
audio="recording.mp3",
provider="deepgram/nova-2"
)
print(text)
# Streaming ASR with interim results
async def microphone_stream():
# Yield audio chunks from microphone
...
async for result in ls.speech_to_text_stream(
audio_stream=microphone_stream(),
provider="deepgram/nova-2",
interim_results=True
):
if result.is_final:
print(f"✓ {result.text}")
else:
print(f" {result.text}...", end="\r", flush=True)
asyncio.run(main())
LLM to TTS (Voice Assistant)
from openai import AsyncOpenAI
from litespeech import LiteSpeech
import asyncio
async def main():
openai = AsyncOpenAI()
ls = LiteSpeech()
# Get LLM stream
llm_stream = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
# Pipe directly to TTS (auto-detects OpenAI stream!)
async for audio_chunk in ls.text_to_speech_stream(
text_stream=llm_stream, # Works with OpenAI, Anthropic, LiteLLM
provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb"
):
# Play audio in real-time
pass
asyncio.run(main())
Sync Interface
from litespeech import LiteSpeech
ls = LiteSpeech()
# Use sync interface
audio = ls.sync.text_to_speech(
text="Hello, world!",
provider="elevenlabs/eleven_turbo_v2_5"
)
text = ls.sync.speech_to_text(
audio="recording.mp3",
provider="deepgram/nova-2"
)
# Streaming (returns sync iterator)
for chunk in ls.sync.text_to_speech_stream(
text="Hello",
provider="elevenlabs/eleven_turbo_v2_5"
):
process(chunk)
for result in ls.sync.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram/nova-2",
interim_results=True
):
print(result.text, result.is_final)
Provider String Format
LiteSpeech uses a unified provider string format: provider/model[/voice]
TTS Examples:
elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb- ElevenLabs with specific voicedeepgram/aura-asteria-en- Deepgram Auracartesia/sonic-3- Cartesia Sonicopenai/tts-1/alloy- OpenAI TTSazure/en-US-AvaMultilingualNeural- Azure Speech
ASR Examples:
deepgram/nova-2- Deepgram Novaelevenlabs/scribe_v1- ElevenLabs Scribe (batch)elevenlabs- ElevenLabs Scribe (streaming, usesscribe_v2_realtime)cartesia/ink-whisper- Cartesia Inkopenai/whisper-1- OpenAI Whisperazure- Azure Speech-to-Text
Supported Providers
| Provider | TTS Batch | TTS Streaming | ASR Batch | ASR Streaming |
|---|---|---|---|---|
| ElevenLabs | ✅ | ✅ | ✅ | ✅ |
| Deepgram | ✅ | ✅ | ✅ | ✅ |
| Cartesia | ✅ | ✅ | ✅ | ✅ |
| OpenAI | ✅ | ❌ | ✅ | ❌ |
| Azure | ✅ | ✅ | ✅ | ❌ |
API Reference
LiteSpeech Client
from litespeech import LiteSpeech
ls = LiteSpeech(
elevenlabs_api_key="sk_...", # Optional, uses ELEVENLABS_API_KEY env var
deepgram_api_key="...", # Optional, uses DEEPGRAM_API_KEY env var
cartesia_api_key="...", # Optional, uses CARTESIA_API_KEY env var
openai_api_key="sk-...", # Optional, uses OPENAI_API_KEY env var
azure_speech_key="...", # Optional, uses AZURE_SPEECH_KEY env var
azure_speech_region="eastus" # Optional, uses AZURE_SPEECH_REGION env var
)
Utility Methods:
# List available providers
ls.list_providers() # All providers
ls.list_providers(capability="tts") # Only TTS providers
ls.list_providers(capability="asr") # Only ASR providers
# Check streaming support
ls.supports_streaming("deepgram", "tts") # True
ls.supports_streaming("openai", "tts") # False
# Access provider registry
ls.registry.list_tts_providers()
ls.registry.list_asr_providers()
Text-to-Speech
Batch TTS
audio = await ls.text_to_speech(
text="Hello, world!",
provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb",
voice=None, # Override voice from provider string
language=None, # Language code (provider-specific)
output_format="mp3", # Output format (mp3, wav, pcm, etc.)
**kwargs # Provider-specific options
)
# Returns: bytes (audio data)
Streaming TTS
async for chunk in ls.text_to_speech_stream(
text="Hello, this is streaming!", # Static text
# OR
text_stream=llm_stream, # Async iterator or LLM stream
provider="elevenlabs/eleven_turbo_v2_5",
voice=None,
language=None,
output_format="pcm_16000",
sample_rate=16000, # Optional: for providers that support it
**kwargs
):
# Process audio chunk
pass
# Yields: bytes (audio chunks)
Note: Some providers (Cartesia, Deepgram) accept sample_rate as a separate parameter for streaming output.
Output Formats (provider-specific):
| Provider | Formats |
|---|---|
| ElevenLabs | mp3_44100_128, mp3_32000_128, pcm_16000, pcm_22050, pcm_24000, pcm_44100 |
| Deepgram | mp3, linear16, alaw, mulaw |
| Cartesia | pcm_s16le, wav, mp3 |
| OpenAI | mp3, opus, aac, flac |
| Azure | audio-16khz-128kbitrate-mono-mp3, audio-24khz-160kbitrate-mono-mp3, riff-16khz-16bit-mono-pcm |
Speech-to-Text
Batch ASR
text = await ls.speech_to_text(
audio="recording.mp3", # File path (str or Path) or bytes
provider="deepgram/nova-2",
language=None, # Language code
preprocess=True, # Auto-detect and convert audio format
**kwargs # Provider-specific options (e.g., punctuate, smart_format)
)
# Returns: str (transcribed text)
Provider-specific kwargs:
- Deepgram:
punctuate,smart_format,diarize,detect_language,paragraphs,utterances - ElevenLabs: Language auto-detection built-in
- OpenAI:
response_format,temperature
Streaming ASR
async for result in ls.speech_to_text_stream(
audio_stream=mic_stream, # AsyncIterator[bytes] of audio chunks
provider="deepgram/nova-2",
language=None,
interim_results=False, # Include partial transcriptions
deduplicate=True, # Filter duplicate transcripts (default: True)
sample_rate=16000, # Audio sample rate (MUST match your audio!)
channels=1, # Number of audio channels
encoding="linear16", # Audio encoding
**kwargs # Provider-specific options
):
print(result.text, result.is_final)
# Yields: ASRResult(text: str, is_final: bool)
Provider-specific kwargs for streaming:
- Deepgram:
diarize,vad_events,endpointing - ElevenLabs:
audio_format(e.g.,pcm_16000) - Cartesia:
encoding(e.g.,pcm_s16le)
Sync Interface
All async methods are available synchronously via the .sync property:
# Batch operations
audio = ls.sync.text_to_speech(text="Hello", provider="elevenlabs")
text = ls.sync.speech_to_text(audio="file.wav", provider="deepgram")
# Streaming operations (returns sync iterators)
for chunk in ls.sync.text_to_speech_stream(text="Hello", provider="elevenlabs"):
process(chunk)
for result in ls.sync.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram",
interim_results=True
):
print(result.text)
LLM Integration
LiteSpeech automatically detects and adapts LLM completion streams for TTS.
Supported LLM Providers
| Provider | Stream Types |
|---|---|
| OpenAI | AsyncStream[ChatCompletionChunk], Responses API |
| Anthropic | AsyncMessageStream, MessageStream, .text_stream |
| LiteLLM | LiteLLM completion streams |
OpenAI Example
from openai import AsyncOpenAI
from litespeech import LiteSpeech
async def main():
openai = AsyncOpenAI()
ls = LiteSpeech()
llm_stream = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
)
# Auto-detected and adapted!
async for audio in ls.text_to_speech_stream(
text_stream=llm_stream,
provider="elevenlabs/eleven_turbo_v2_5"
):
play_audio(audio)
asyncio.run(main())
Anthropic Example
from anthropic import AsyncAnthropic
from litespeech import LiteSpeech
async def main():
anthropic = AsyncAnthropic()
ls = LiteSpeech()
stream = anthropic.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Say something interesting"}]
)
# Works with Anthropic too!
async for audio in ls.text_to_speech_stream(
text_stream=stream,
provider="elevenlabs/eleven_turbo_v2_5"
):
play_audio(audio)
asyncio.run(main())
Plain Async Iterator (Simulated LLM)
async def simulate_llm_stream(text: str, delay: float = 0.1):
"""Simulate LLM token streaming by yielding words with a delay."""
words = text.split()
for i, word in enumerate(words):
await asyncio.sleep(delay)
yield word if i == 0 else f" {word}"
async def main():
ls = LiteSpeech()
text = "Hello! This is simulated LLM output being streamed to TTS."
async for audio in ls.text_to_speech_stream(
text_stream=simulate_llm_stream(text),
provider="cartesia/sonic-3",
voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
language="en",
sample_rate=16000,
):
play_audio(audio)
asyncio.run(main())
Audio Processing
Audio Format Detection
LiteSpeech automatically detects audio formats via magic bytes and header parsing:
- WAV: RIFF header, sample rate, channels, bit depth
- MP3: ID3 tags, sync words, MPEG version, bitrate
- FLAC: STREAMINFO metadata block
- OGG/OPUS: OggS container
- WEBM: EBML header
Audio Conversion
With litespeech[audio] installed, automatic format conversion is available:
# Auto-converts to provider's preferred format
text = await ls.speech_to_text(
audio="recording.m4a", # Will be converted to WAV/PCM
provider="deepgram/nova-2",
preprocess=True # Default: True
)
Supported Conversions:
- Format changes (MP3 → WAV, etc.)
- Sample rate resampling
- Channel mixing (stereo → mono)
Streaming Audio Parameters
For streaming ASR, you must specify audio parameters (cannot be auto-detected from raw PCM):
async for result in ls.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram/nova-2",
sample_rate=16000, # REQUIRED: Audio sample rate
channels=1, # Audio channels (default: 1)
encoding="linear16", # Audio encoding (default: linear16)
):
print(result.text)
ASR Streaming Results
All ASR streaming methods return AsyncIterator[ASRResult]:
from litespeech import ASRResult
@dataclass
class ASRResult:
text: str # Transcribed text
is_final: bool # True for final results, False for interim
Interim vs Final Results
async for result in ls.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram/nova-2",
interim_results=True, # Enable interim results
):
if result.is_final:
# Committed transcription - won't change
print(f"✓ Final: {result.text}")
else:
# Partial transcription - may change
print(f" Interim: {result.text}...", end="\r", flush=True)
Behavior:
interim_results=False(default): Only yields final results (is_final=True)interim_results=True: Yields both interim and final results
Deduplication
Most ASR providers send the full accumulated transcript with each update (not deltas):
Provider sends: "Hello" → "Hello world" → "Hello world how" → "Hello world how are you"
With deduplicate=True (default): Only yields when text changes
async for result in ls.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram",
deduplicate=True # Default
):
# Only unique text values are yielded
print(result.text)
With deduplicate=False: Pass through every message
async for result in ls.speech_to_text_stream(
audio_stream=mic_stream,
provider="deepgram",
deduplicate=False # Raw provider behavior
):
# May receive duplicate values
print(result.text)
Configuration
API Keys
LiteSpeech accepts explicit parameter names that map to environment variables.
Option 1: Environment Variables (Recommended)
export ELEVENLABS_API_KEY=sk_...
export DEEPGRAM_API_KEY=...
export CARTESIA_API_KEY=...
export OPENAI_API_KEY=sk-...
export AZURE_SPEECH_KEY=...
export AZURE_SPEECH_REGION=eastus
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export GOOGLE_PROJECT_ID=my-project
ls = LiteSpeech() # Auto-detects from environment
Option 2: Explicit Parameters
ls = LiteSpeech(
elevenlabs_api_key="sk_...",
deepgram_api_key="...",
cartesia_api_key="...",
openai_api_key="sk-...",
azure_speech_key="...",
azure_speech_region="eastus"
)
Parameter Mapping:
| Parameter | Environment Variable |
|---|---|
elevenlabs_api_key |
ELEVENLABS_API_KEY |
openai_api_key |
OPENAI_API_KEY |
deepgram_api_key |
DEEPGRAM_API_KEY |
cartesia_api_key |
CARTESIA_API_KEY |
azure_speech_key |
AZURE_SPEECH_KEY |
azure_speech_region |
AZURE_SPEECH_REGION |
google_application_credentials |
GOOGLE_APPLICATION_CREDENTIALS |
google_project_id |
GOOGLE_PROJECT_ID |
Validation:
# ❌ Raises ValueError - unknown parameter
ls = LiteSpeech(invalid_param="value")
# ✅ Correct usage
ls = LiteSpeech(cartesia_api_key="sk_car_...")
Debug Logging
# Enable debug logging for all components
export LITESPEECH_LOG_LEVEL=DEBUG
python your_script.py
# Log format options
export LITESPEECH_LOG_FORMAT=detailed # or simple, json
Log Levels:
DEBUG: Verbose WebSocket/chunk detailsINFO: General operation infoWARNING: Non-optimal configurations (default)ERROR: Errors and exceptions
Provider-Specific Details
ElevenLabs
TTS:
- Models:
eleven_turbo_v2_5,eleven_multilingual_v2,eleven_monolingual_v1 - Default voice:
JBFqnCBsd6RMkjVDRZzb(George) - Formats:
mp3_44100_128,mp3_32000_128,pcm_16000,pcm_22050,pcm_24000,pcm_44100
ASR:
- Batch models:
scribe_v1,scribe_v1_experimental - Streaming model:
scribe_v2_realtime(different from batch!) - Format:
pcm_16000
Important: Batch and streaming use different models. Using scribe_v1 for streaming will raise an error.
# TTS with specific voice
audio = await ls.text_to_speech(
text="Hello world",
provider="elevenlabs/eleven_turbo_v2_5",
voice="JBFqnCBsd6RMkjVDRZzb", # George voice
output_format="mp3_44100_128",
)
# Batch ASR (uses scribe_v1)
text = await ls.speech_to_text(audio, provider="elevenlabs/scribe_v1")
# Streaming ASR (must use scribe_v2_realtime or omit model)
async for result in ls.speech_to_text_stream(
audio_stream=mic,
provider="elevenlabs", # Defaults to scribe_v2_realtime
sample_rate=16000,
):
if result.is_final:
print(f"Final: {result.text}")
else:
print(f"Interim: {result.text}")
Deepgram
TTS (Aura):
- Models: Aura voices follow pattern
aura-{voice}-{language}(e.g.,aura-asteria-en) - Voices:
asteria,luna,stella,athena,hera,orion,arcas,perseus,angus,orpheus,helios,zeus - You can specify voice and language separately:
provider="deepgram/aura"+voice="asteria"+language="en" - Formats:
mp3,linear16,alaw,mulaw
ASR:
- Models:
nova-3,nova-2,nova-2-general,nova-2-meeting,nova-2-phonecall,nova-2-medical,enhanced,base - Recommended: 16kHz PCM mono
- Language: ISO-639-1, ISO-639-3, BCP-47, or
multifor auto-detection - Provider-specific kwargs:
punctuate,smart_format,diarize,detect_language
# Nova-2 with language and formatting options
text = await ls.speech_to_text(
audio="recording.wav",
provider="deepgram/nova-2",
language="en-US",
punctuate=True, # Add punctuation (default: True)
smart_format=True, # Smart formatting (default: True)
)
# Deepgram Aura TTS streaming
async for chunk in ls.text_to_speech_stream(
text="Hello world",
provider="deepgram/aura",
voice="asteria",
language="en",
sample_rate=24000,
):
play_audio(chunk)
Cartesia
TTS:
- Models:
sonic-3,sonic-2,sonic - Voices: UUID format (e.g.,
79a125e8-cd45-4c13-8a67-188112f4dd22) - Formats:
pcm_s16le,wav,mp3 - Streaming sample rate: 16000Hz (batch can use 44100Hz)
ASR:
- Model:
ink-whisper - Encoding:
pcm_s16le,linear16
# Cartesia TTS streaming
async for chunk in ls.text_to_speech_stream(
text="Hello world",
provider="cartesia/sonic-3",
voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
language="en",
sample_rate=16000,
):
play_audio(chunk)
OpenAI
TTS (Batch only, no streaming):
- Models:
tts-1,tts-1-hd - Voices:
alloy,echo,fable,onyx,nova,shimmer
ASR (Batch only, no streaming):
- Model:
whisper-1
# OpenAI TTS
audio = await ls.text_to_speech(
text="Hello",
provider="openai/tts-1/alloy"
)
# OpenAI Whisper
text = await ls.speech_to_text(
audio="recording.mp3",
provider="openai/whisper-1"
)
Azure Speech Services
TTS:
- Voices: Format like
en-US-AvaMultilingualNeural - Requires:
azure_speech_keyandazure_speech_region - Two ways to specify voice (both work):
ASR (Batch only):
- Requires:
azure_speech_keyandazure_speech_region - Language: BCP-47 format (e.g.,
en-US,es-MX)
ls = LiteSpeech(
azure_speech_key="your-key",
azure_speech_region="eastus"
)
# Azure TTS - Full format (voice in provider string)
audio = await ls.text_to_speech(
text="Hello",
provider="azure/en-US-AvaMultilingualNeural"
)
# Azure TTS - Split format (voice + language separate)
audio = await ls.text_to_speech(
text="Hello",
provider="azure",
voice="AvaMultilingualNeural",
language="en-US"
)
# Azure ASR (uses BCP-47 language codes)
text = await ls.speech_to_text(
audio="recording.wav",
provider="azure",
language="en-US"
)
Error Handling
Exception Hierarchy
LiteSpeechError (base)
├── ProviderError # Provider-specific errors (includes status_code)
├── StreamingError # Streaming-related errors
├── AudioFormatError # Audio format/conversion errors
├── AuthenticationError # API key/authentication errors
├── ProviderNotFoundError # Provider not found in registry
└── UnsupportedOperationError # Operation not supported by provider
Usage
from litespeech import LiteSpeech
from litespeech.exceptions import (
AuthenticationError,
ProviderError,
AudioFormatError,
UnsupportedOperationError
)
try:
text = await ls.speech_to_text(audio, provider="deepgram/nova-2")
except AuthenticationError as e:
print(f"Auth failed for {e.provider}: {e}")
except ProviderError as e:
print(f"Provider error (status {e.status_code}): {e}")
except AudioFormatError as e:
print(f"Audio format issue: {e}")
except UnsupportedOperationError as e:
print(f"Not supported: {e}")
Error Philosophy
- Fail fast with actionable errors: Shows current state, expected state, and specific fixes
- Warn, don't block: Non-optimal configs (like non-recommended sample rates) warn but proceed
- Trust user for raw PCM: Can't validate format without headers - user must know their audio
Examples
FastAPI Voice Assistant
from fastapi import FastAPI, WebSocket
from litespeech import LiteSpeech
from openai import AsyncOpenAI
app = FastAPI()
ls = LiteSpeech()
openai = AsyncOpenAI()
@app.websocket("/voice-assistant")
async def voice_assistant(ws: WebSocket):
await ws.accept()
# ASR: Transcribe user speech
async def audio_stream():
while True:
data = await ws.receive_bytes()
if not data:
break
yield data
async for result in ls.speech_to_text_stream(
audio_stream=audio_stream(),
provider="deepgram/nova-2",
sample_rate=16000
):
if not result.is_final:
continue
# LLM: Generate response
llm_stream = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": result.text}],
stream=True
)
# TTS: Stream audio back
async for audio in ls.text_to_speech_stream(
text_stream=llm_stream,
provider="elevenlabs/eleven_turbo_v2_5",
output_format="pcm_16000"
):
await ws.send_bytes(audio)
Microphone Streaming with sounddevice
import sounddevice as sd
import queue
import asyncio
from collections.abc import AsyncIterator
from litespeech import LiteSpeech
# Audio configuration
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_SIZE = 4096
async def microphone_stream() -> AsyncIterator[bytes]:
"""Stream audio from microphone in real-time."""
# Use thread-safe queue since callback runs in different thread
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
if status:
print(f"[Audio Status] {status}")
# Copy data - sounddevice reuses buffers!
audio_queue.put(indata.copy().tobytes())
# Open microphone stream
stream = sd.InputStream(
samplerate=SAMPLE_RATE,
channels=CHANNELS,
dtype='int16',
blocksize=CHUNK_SIZE // 2,
callback=audio_callback,
)
with stream:
while True:
try:
chunk = audio_queue.get(timeout=0.1)
yield chunk
except queue.Empty:
await asyncio.sleep(0.01)
continue
async def main():
ls = LiteSpeech()
async for result in ls.speech_to_text_stream(
audio_stream=microphone_stream(),
provider="deepgram/nova-2",
language="en",
sample_rate=SAMPLE_RATE,
channels=CHANNELS,
encoding="linear16",
interim_results=True,
):
if result.is_final:
print(f"\n✓ {result.text}")
else:
print(f"\r {result.text}...", end="", flush=True)
asyncio.run(main())
Batch Processing Multiple Files
import asyncio
from pathlib import Path
from litespeech import LiteSpeech
async def transcribe_all(directory: str):
ls = LiteSpeech()
audio_files = Path(directory).glob("*.wav")
tasks = [
ls.speech_to_text(str(f), provider="deepgram/nova-2")
for f in audio_files
]
results = await asyncio.gather(*tasks)
return dict(zip(audio_files, results))
transcriptions = asyncio.run(transcribe_all("./recordings"))
for file, text in transcriptions.items():
print(f"{file.name}: {text[:100]}...")
Development
Setup
# Clone repository
git clone https://github.com/your-org/litespeech.git
cd litespeech
# Install with dev dependencies
uv pip install -e ".[dev]"
# Install with audio support
uv pip install -e ".[audio]"
Testing
# Run all tests
pytest
# Run specific test file
pytest tests/test_audio.py
# Run with coverage
pytest --cov=litespeech --cov-report=html
# Run specific test
pytest tests/test_audio.py::test_wav_to_wav_no_conversion -v
Linting & Type Checking
# Format and lint with ruff
ruff check litespeech/
ruff format litespeech/
# Type check with mypy
mypy litespeech/
Project Structure
litespeech/
├── __init__.py # Public API exports
├── client.py # Main LiteSpeech class
├── config.py # API key configuration
├── exceptions.py # Exception hierarchy
├── version.py # Version info
├── providers/
│ ├── base.py # Abstract provider interfaces
│ ├── registry.py # Provider discovery and routing
│ ├── tts/ # TTS providers (elevenlabs, deepgram, cartesia, openai, azure)
│ └── asr/ # ASR providers (elevenlabs, deepgram, cartesia, openai, azure)
├── audio/
│ ├── types.py # AudioFormat, AudioInfo, AudioChunk
│ ├── detection.py # Format detection
│ ├── conversion.py # Format conversion
│ ├── specs.py # Provider specifications
│ └── stream_validator.py # Stream validation
├── adapters/
│ ├── base.py # StreamAdapter interface
│ ├── auto_detect.py # LLM stream auto-detection
│ ├── openai_adapter.py
│ ├── anthropic_adapter.py
│ └── litellm_adapter.py
└── utils/
├── logging.py # Logging setup
└── debug.py # Debug utilities
Adding a New Provider
- Create provider class in
providers/{tts,asr}/{provider_name}.py:
from litespeech.providers.base import ASRProvider, ProviderInfo, ProviderCapabilities
class MyProviderASRProvider(ASRProvider):
"""My Provider ASR implementation."""
DEFAULT_MODEL = "my-model"
def __init__(self, api_key: str | None = None):
super().__init__(api_key)
self._api_key = api_key or os.environ.get("MYPROVIDER_API_KEY")
@property
def info(self) -> ProviderInfo:
return ProviderInfo(
name="myprovider",
display_name="My Provider",
capabilities=ProviderCapabilities(asr_batch=True, asr_streaming=True),
default_model=self.DEFAULT_MODEL,
)
@classmethod
def get_audio_specs(cls, model: str | None = None) -> dict:
return {"preferred": {"format": "wav"}, "recommended_sample_rate": 16000}
async def speech_to_text(self, audio, model=None, language=None, **kwargs) -> str:
# Implementation
...
async def speech_to_text_stream(self, audio_stream, model=None, **kwargs):
# Implementation
...
- Register in
providers/{tts,asr}/__init__.py:
from .myprovider import MyProviderASRProvider
That's it! Your provider is now available: ls.speech_to_text(audio, provider="myprovider")
Publishing to PyPI
For maintainers: How to publish a new release
-
Update version in
pyproject.toml:version = "0.2.0" # Bump version
-
Commit and tag:
git add . git commit -m "Release v0.2.0" git tag v0.2.0 git push origin main --tags
-
Build and publish:
# Clean old builds rm -rf dist/ build/ *.egg-info # Build with UV uv build # Test on TestPyPI (optional but recommended) uv publish --publish-url https://test.pypi.org/legacy/ # Publish to PyPI uv publish
What gets published:
- Wheel file (
.whl) - Containslitespeech/package code - Source distribution (
.tar.gz) - Contains code + examples + docs
Note: Examples are included in the source distribution and visible on PyPI, but not installed with pip install. Users can find examples on GitHub or by downloading the source tarball.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litespeech-0.1.0.tar.gz.
File metadata
- Download URL: litespeech-0.1.0.tar.gz
- Upload date:
- Size: 68.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1abcc845487b767890b9804a17eaca69876aba394207b7c5fe2b69ee648e5ba3
|
|
| MD5 |
1aca3c8b142fc4a9548df2f6d380adba
|
|
| BLAKE2b-256 |
efeadc720cb3405708ca0afff111c4d7ab49232b89756a8e8a5d4a49ac547a3e
|
File details
Details for the file litespeech-0.1.0-py3-none-any.whl.
File metadata
- Download URL: litespeech-0.1.0-py3-none-any.whl
- Upload date:
- Size: 79.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b43c64a1e8c5158f15777df1c89a5a43cedd0a8f77272eea31a2a52fad55eb01
|
|
| MD5 |
2866d8921cb3a4ade6e198ba53844ee9
|
|
| BLAKE2b-256 |
1d0b31aaf0757d35029197ba18402d5ecf401e9e537a7bd98f6b31d862b3cd53
|