Unified SDK for speech operations (ASR/TTS) with streaming support across multiple providers

These details have not been verified by PyPI

Project links

Project description

LiteSpeech

Unified SDK for speech operations (ASR/TTS) with streaming support across multiple providers.

LiteSpeech provides a consistent interface for text-to-speech and speech-to-text across providers like ElevenLabs, Deepgram, Cartesia, OpenAI, and Azure. It features first-class support for streaming and seamless integration with LLM outputs.

Features
Installation
Quick Start
Provider String Format
Supported Providers
API Reference
LLM Integration
Audio Processing
ASR Streaming Results
Configuration
Provider-Specific Details
Error Handling
Examples
Development
License

Features

Multi-Provider Support: ElevenLabs, Deepgram, Cartesia, OpenAI, Azure Speech Services
Streaming-First: True streaming TTS and ASR where supported
LLM Integration: Auto-detect and pipe OpenAI/Anthropic/LiteLLM streams to TTS
Unified API: Same interface across all providers
Sync + Async: Primary async interface with sync wrapper
Audio Preprocessing: Auto-detect and convert audio formats
Interim Results: Real-time partial transcriptions with clear final/interim marking
Deduplication: Smart filtering of duplicate transcripts in streaming ASR

Installation

pip install litespeech

With audio conversion support (recommended for format conversion):

pip install litespeech[audio]

With development dependencies:

pip install litespeech[dev]

Quick Start

Text-to-Speech

from litespeech import LiteSpeech
import asyncio

async def main():
    ls = LiteSpeech()

    # Batch TTS
    audio = await ls.text_to_speech(
        text="Hello, world!",
        provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb"
    )

    with open("output.mp3", "wb") as f:
        f.write(audio)

    # Streaming TTS
    async for chunk in ls.text_to_speech_stream(
        text="Hello, this is streaming TTS!",
        provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb",
        output_format="pcm_16000"
    ):
        # Play or process audio chunk
        pass

asyncio.run(main())

Speech-to-Text

from litespeech import LiteSpeech
import asyncio

async def main():
    ls = LiteSpeech()

    # Batch ASR
    text = await ls.speech_to_text(
        audio="recording.mp3",
        provider="deepgram/nova-2"
    )
    print(text)

    # Streaming ASR with interim results
    async def microphone_stream():
        # Yield audio chunks from microphone
        ...

    async for result in ls.speech_to_text_stream(
        audio_stream=microphone_stream(),
        provider="deepgram/nova-2",
        interim_results=True
    ):
        if result.is_final:
            print(f"✓ {result.text}")
        else:
            print(f"  {result.text}...", end="\r", flush=True)

asyncio.run(main())

LLM to TTS (Voice Assistant)

from openai import AsyncOpenAI
from litespeech import LiteSpeech
import asyncio

async def main():
    openai = AsyncOpenAI()
    ls = LiteSpeech()

    # Get LLM stream
    llm_stream = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Tell me a story"}],
        stream=True
    )

    # Pipe directly to TTS (auto-detects OpenAI stream!)
    async for audio_chunk in ls.text_to_speech_stream(
        text_stream=llm_stream,  # Works with OpenAI, Anthropic, LiteLLM
        provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb"
    ):
        # Play audio in real-time
        pass

asyncio.run(main())

Sync Interface

from litespeech import LiteSpeech

ls = LiteSpeech()

# Use sync interface
audio = ls.sync.text_to_speech(
    text="Hello, world!",
    provider="elevenlabs/eleven_turbo_v2_5"
)

text = ls.sync.speech_to_text(
    audio="recording.mp3",
    provider="deepgram/nova-2"
)

# Streaming (returns sync iterator)
for chunk in ls.sync.text_to_speech_stream(
    text="Hello",
    provider="elevenlabs/eleven_turbo_v2_5"
):
    process(chunk)

for result in ls.sync.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram/nova-2",
    interim_results=True
):
    print(result.text, result.is_final)

Provider String Format

LiteSpeech uses a unified provider string format: provider/model[/voice]

TTS Examples:

elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb - ElevenLabs with specific voice
deepgram/aura-asteria-en - Deepgram Aura
cartesia/sonic-3 - Cartesia Sonic
openai/tts-1/alloy - OpenAI TTS
azure/en-US-AvaMultilingualNeural - Azure Speech

ASR Examples:

deepgram/nova-2 - Deepgram Nova
elevenlabs/scribe_v1 - ElevenLabs Scribe (batch)
elevenlabs - ElevenLabs Scribe (streaming, uses scribe_v2_realtime)
cartesia/ink-whisper - Cartesia Ink
openai/whisper-1 - OpenAI Whisper
azure - Azure Speech-to-Text

Supported Providers

Provider	TTS Batch	TTS Streaming	ASR Batch	ASR Streaming
ElevenLabs	✅	✅	✅	✅
Deepgram	✅	✅	✅	✅
Cartesia	✅	✅	✅	✅
OpenAI	✅	❌	✅	❌
Azure	✅	✅	✅	❌

API Reference

LiteSpeech Client

from litespeech import LiteSpeech

ls = LiteSpeech(
    elevenlabs_api_key="sk_...",      # Optional, uses ELEVENLABS_API_KEY env var
    deepgram_api_key="...",            # Optional, uses DEEPGRAM_API_KEY env var
    cartesia_api_key="...",            # Optional, uses CARTESIA_API_KEY env var
    openai_api_key="sk-...",           # Optional, uses OPENAI_API_KEY env var
    azure_speech_key="...",            # Optional, uses AZURE_SPEECH_KEY env var
    azure_speech_region="eastus"       # Optional, uses AZURE_SPEECH_REGION env var
)

Utility Methods:

# List available providers
ls.list_providers()                    # All providers
ls.list_providers(capability="tts")    # Only TTS providers
ls.list_providers(capability="asr")    # Only ASR providers

# Check streaming support
ls.supports_streaming("deepgram", "tts")   # True
ls.supports_streaming("openai", "tts")     # False

# Access provider registry
ls.registry.list_tts_providers()
ls.registry.list_asr_providers()

Text-to-Speech

Batch TTS

audio = await ls.text_to_speech(
    text="Hello, world!",
    provider="elevenlabs/eleven_turbo_v2_5/JBFqnCBsd6RMkjVDRZzb",
    voice=None,           # Override voice from provider string
    language=None,        # Language code (provider-specific)
    output_format="mp3",  # Output format (mp3, wav, pcm, etc.)
    **kwargs              # Provider-specific options
)
# Returns: bytes (audio data)

Streaming TTS

async for chunk in ls.text_to_speech_stream(
    text="Hello, this is streaming!",   # Static text
    # OR
    text_stream=llm_stream,             # Async iterator or LLM stream
    provider="elevenlabs/eleven_turbo_v2_5",
    voice=None,
    language=None,
    output_format="pcm_16000",
    sample_rate=16000,                  # Optional: for providers that support it
    **kwargs
):
    # Process audio chunk
    pass
# Yields: bytes (audio chunks)

Note: Some providers (Cartesia, Deepgram) accept sample_rate as a separate parameter for streaming output.

Output Formats (provider-specific):

Provider	Formats
ElevenLabs	`mp3_44100_128`, `mp3_32000_128`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `pcm_44100`
Deepgram	`mp3`, `linear16`, `alaw`, `mulaw`
Cartesia	`pcm_s16le`, `wav`, `mp3`
OpenAI	`mp3`, `opus`, `aac`, `flac`
Azure	`audio-16khz-128kbitrate-mono-mp3`, `audio-24khz-160kbitrate-mono-mp3`, `riff-16khz-16bit-mono-pcm`

Speech-to-Text

Batch ASR

text = await ls.speech_to_text(
    audio="recording.mp3",   # File path (str or Path) or bytes
    provider="deepgram/nova-2",
    language=None,           # Language code
    preprocess=True,         # Auto-detect and convert audio format
    **kwargs                 # Provider-specific options (e.g., punctuate, smart_format)
)
# Returns: str (transcribed text)

Provider-specific kwargs:

Deepgram: punctuate, smart_format, diarize, detect_language, paragraphs, utterances
ElevenLabs: Language auto-detection built-in
OpenAI: response_format, temperature

Streaming ASR

async for result in ls.speech_to_text_stream(
    audio_stream=mic_stream,     # AsyncIterator[bytes] of audio chunks
    provider="deepgram/nova-2",
    language=None,
    interim_results=False,       # Include partial transcriptions
    deduplicate=True,            # Filter duplicate transcripts (default: True)
    sample_rate=16000,           # Audio sample rate (MUST match your audio!)
    channels=1,                  # Number of audio channels
    encoding="linear16",         # Audio encoding
    **kwargs                     # Provider-specific options
):
    print(result.text, result.is_final)
# Yields: ASRResult(text: str, is_final: bool)

Provider-specific kwargs for streaming:

Deepgram: diarize, vad_events, endpointing
ElevenLabs: audio_format (e.g., pcm_16000)
Cartesia: encoding (e.g., pcm_s16le)

Sync Interface

All async methods are available synchronously via the .sync property:

# Batch operations
audio = ls.sync.text_to_speech(text="Hello", provider="elevenlabs")
text = ls.sync.speech_to_text(audio="file.wav", provider="deepgram")

# Streaming operations (returns sync iterators)
for chunk in ls.sync.text_to_speech_stream(text="Hello", provider="elevenlabs"):
    process(chunk)

for result in ls.sync.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram",
    interim_results=True
):
    print(result.text)

LLM Integration

LiteSpeech automatically detects and adapts LLM completion streams for TTS.

Supported LLM Providers

Provider	Stream Types
OpenAI	`AsyncStream[ChatCompletionChunk]`, Responses API
Anthropic	`AsyncMessageStream`, `MessageStream`, `.text_stream`
LiteLLM	LiteLLM completion streams

OpenAI Example

from openai import AsyncOpenAI
from litespeech import LiteSpeech

async def main():
    openai = AsyncOpenAI()
    ls = LiteSpeech()

    llm_stream = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Tell me a joke"}],
        stream=True
    )

    # Auto-detected and adapted!
    async for audio in ls.text_to_speech_stream(
        text_stream=llm_stream,
        provider="elevenlabs/eleven_turbo_v2_5"
    ):
        play_audio(audio)

asyncio.run(main())

Anthropic Example

from anthropic import AsyncAnthropic
from litespeech import LiteSpeech

async def main():
    anthropic = AsyncAnthropic()
    ls = LiteSpeech()

    stream = anthropic.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Say something interesting"}]
    )

    # Works with Anthropic too!
    async for audio in ls.text_to_speech_stream(
        text_stream=stream,
        provider="elevenlabs/eleven_turbo_v2_5"
    ):
        play_audio(audio)

asyncio.run(main())

Plain Async Iterator (Simulated LLM)

async def simulate_llm_stream(text: str, delay: float = 0.1):
    """Simulate LLM token streaming by yielding words with a delay."""
    words = text.split()
    for i, word in enumerate(words):
        await asyncio.sleep(delay)
        yield word if i == 0 else f" {word}"

async def main():
    ls = LiteSpeech()

    text = "Hello! This is simulated LLM output being streamed to TTS."

    async for audio in ls.text_to_speech_stream(
        text_stream=simulate_llm_stream(text),
        provider="cartesia/sonic-3",
        voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
        language="en",
        sample_rate=16000,
    ):
        play_audio(audio)

asyncio.run(main())

Audio Processing

Audio Format Detection

LiteSpeech automatically detects audio formats via magic bytes and header parsing:

WAV: RIFF header, sample rate, channels, bit depth
MP3: ID3 tags, sync words, MPEG version, bitrate
FLAC: STREAMINFO metadata block
OGG/OPUS: OggS container
WEBM: EBML header

Audio Conversion

With litespeech[audio] installed, automatic format conversion is available:

# Auto-converts to provider's preferred format
text = await ls.speech_to_text(
    audio="recording.m4a",  # Will be converted to WAV/PCM
    provider="deepgram/nova-2",
    preprocess=True         # Default: True
)

Supported Conversions:

Format changes (MP3 → WAV, etc.)
Sample rate resampling
Channel mixing (stereo → mono)

Streaming Audio Parameters

For streaming ASR, you must specify audio parameters (cannot be auto-detected from raw PCM):

async for result in ls.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram/nova-2",
    sample_rate=16000,       # REQUIRED: Audio sample rate
    channels=1,              # Audio channels (default: 1)
    encoding="linear16",     # Audio encoding (default: linear16)
):
    print(result.text)

ASR Streaming Results

All ASR streaming methods return AsyncIterator[ASRResult]:

from litespeech import ASRResult

@dataclass
class ASRResult:
    text: str       # Transcribed text
    is_final: bool  # True for final results, False for interim

Interim vs Final Results

async for result in ls.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram/nova-2",
    interim_results=True,  # Enable interim results
):
    if result.is_final:
        # Committed transcription - won't change
        print(f"✓ Final: {result.text}")
    else:
        # Partial transcription - may change
        print(f"  Interim: {result.text}...", end="\r", flush=True)

Behavior:

interim_results=False (default): Only yields final results (is_final=True)
interim_results=True: Yields both interim and final results

Deduplication

Most ASR providers send the full accumulated transcript with each update (not deltas):

Provider sends: "Hello" → "Hello world" → "Hello world how" → "Hello world how are you"

With deduplicate=True (default): Only yields when text changes

async for result in ls.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram",
    deduplicate=True  # Default
):
    # Only unique text values are yielded
    print(result.text)

With deduplicate=False: Pass through every message

async for result in ls.speech_to_text_stream(
    audio_stream=mic_stream,
    provider="deepgram",
    deduplicate=False  # Raw provider behavior
):
    # May receive duplicate values
    print(result.text)

Configuration

API Keys

LiteSpeech accepts explicit parameter names that map to environment variables.

Option 1: Environment Variables (Recommended)

export ELEVENLABS_API_KEY=sk_...
export DEEPGRAM_API_KEY=...
export CARTESIA_API_KEY=...
export OPENAI_API_KEY=sk-...
export AZURE_SPEECH_KEY=...
export AZURE_SPEECH_REGION=eastus
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export GOOGLE_PROJECT_ID=my-project

ls = LiteSpeech()  # Auto-detects from environment

Option 2: Explicit Parameters

ls = LiteSpeech(
    elevenlabs_api_key="sk_...",
    deepgram_api_key="...",
    cartesia_api_key="...",
    openai_api_key="sk-...",
    azure_speech_key="...",
    azure_speech_region="eastus"
)

Parameter Mapping:

Parameter	Environment Variable
`elevenlabs_api_key`	`ELEVENLABS_API_KEY`
`openai_api_key`	`OPENAI_API_KEY`
`deepgram_api_key`	`DEEPGRAM_API_KEY`
`cartesia_api_key`	`CARTESIA_API_KEY`
`azure_speech_key`	`AZURE_SPEECH_KEY`
`azure_speech_region`	`AZURE_SPEECH_REGION`
`google_application_credentials`	`GOOGLE_APPLICATION_CREDENTIALS`
`google_project_id`	`GOOGLE_PROJECT_ID`

Validation:

# ❌ Raises ValueError - unknown parameter
ls = LiteSpeech(invalid_param="value")

# ✅ Correct usage
ls = LiteSpeech(cartesia_api_key="sk_car_...")

Debug Logging

# Enable debug logging for all components
export LITESPEECH_LOG_LEVEL=DEBUG
python your_script.py

# Log format options
export LITESPEECH_LOG_FORMAT=detailed  # or simple, json

Log Levels:

DEBUG: Verbose WebSocket/chunk details
INFO: General operation info
WARNING: Non-optimal configurations (default)
ERROR: Errors and exceptions

Provider-Specific Details

ElevenLabs

TTS:

Models: eleven_turbo_v2_5, eleven_multilingual_v2, eleven_monolingual_v1
Default voice: JBFqnCBsd6RMkjVDRZzb (George)
Formats: mp3_44100_128, mp3_32000_128, pcm_16000, pcm_22050, pcm_24000, pcm_44100

ASR:

Batch models: scribe_v1, scribe_v1_experimental
Streaming model: scribe_v2_realtime (different from batch!)
Format: pcm_16000

Important: Batch and streaming use different models. Using scribe_v1 for streaming will raise an error.

# TTS with specific voice
audio = await ls.text_to_speech(
    text="Hello world",
    provider="elevenlabs/eleven_turbo_v2_5",
    voice="JBFqnCBsd6RMkjVDRZzb",  # George voice
    output_format="mp3_44100_128",
)

# Batch ASR (uses scribe_v1)
text = await ls.speech_to_text(audio, provider="elevenlabs/scribe_v1")

# Streaming ASR (must use scribe_v2_realtime or omit model)
async for result in ls.speech_to_text_stream(
    audio_stream=mic,
    provider="elevenlabs",  # Defaults to scribe_v2_realtime
    sample_rate=16000,
):
    if result.is_final:
        print(f"Final: {result.text}")
    else:
        print(f"Interim: {result.text}")

Deepgram

TTS (Aura):

Models: Aura voices follow pattern aura-{voice}-{language} (e.g., aura-asteria-en)
Voices: asteria, luna, stella, athena, hera, orion, arcas, perseus, angus, orpheus, helios, zeus
You can specify voice and language separately: provider="deepgram/aura" + voice="asteria" + language="en"
Formats: mp3, linear16, alaw, mulaw

ASR:

Models: nova-3, nova-2, nova-2-general, nova-2-meeting, nova-2-phonecall, nova-2-medical, enhanced, base
Recommended: 16kHz PCM mono
Language: ISO-639-1, ISO-639-3, BCP-47, or multi for auto-detection
Provider-specific kwargs: punctuate, smart_format, diarize, detect_language

# Nova-2 with language and formatting options
text = await ls.speech_to_text(
    audio="recording.wav",
    provider="deepgram/nova-2",
    language="en-US",
    punctuate=True,       # Add punctuation (default: True)
    smart_format=True,    # Smart formatting (default: True)
)

# Deepgram Aura TTS streaming
async for chunk in ls.text_to_speech_stream(
    text="Hello world",
    provider="deepgram/aura",
    voice="asteria",
    language="en",
    sample_rate=24000,
):
    play_audio(chunk)

Cartesia

TTS:

Models: sonic-3, sonic-2, sonic
Voices: UUID format (e.g., 79a125e8-cd45-4c13-8a67-188112f4dd22)
Formats: pcm_s16le, wav, mp3
Streaming sample rate: 16000Hz (batch can use 44100Hz)

ASR:

Model: ink-whisper
Encoding: pcm_s16le, linear16

# Cartesia TTS streaming
async for chunk in ls.text_to_speech_stream(
    text="Hello world",
    provider="cartesia/sonic-3",
    voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
    language="en",
    sample_rate=16000,
):
    play_audio(chunk)

OpenAI

TTS (Batch only, no streaming):

Models: tts-1, tts-1-hd
Voices: alloy, echo, fable, onyx, nova, shimmer

ASR (Batch only, no streaming):

Model: whisper-1

# OpenAI TTS
audio = await ls.text_to_speech(
    text="Hello",
    provider="openai/tts-1/alloy"
)

# OpenAI Whisper
text = await ls.speech_to_text(
    audio="recording.mp3",
    provider="openai/whisper-1"
)

Azure Speech Services

TTS:

Voices: Format like en-US-AvaMultilingualNeural
Requires: azure_speech_key and azure_speech_region
Two ways to specify voice (both work):

ASR (Batch only):

Requires: azure_speech_key and azure_speech_region
Language: BCP-47 format (e.g., en-US, es-MX)

ls = LiteSpeech(
    azure_speech_key="your-key",
    azure_speech_region="eastus"
)

# Azure TTS - Full format (voice in provider string)
audio = await ls.text_to_speech(
    text="Hello",
    provider="azure/en-US-AvaMultilingualNeural"
)

# Azure TTS - Split format (voice + language separate)
audio = await ls.text_to_speech(
    text="Hello",
    provider="azure",
    voice="AvaMultilingualNeural",
    language="en-US"
)

# Azure ASR (uses BCP-47 language codes)
text = await ls.speech_to_text(
    audio="recording.wav",
    provider="azure",
    language="en-US"
)

Error Handling

Exception Hierarchy

LiteSpeechError (base)
├── ProviderError          # Provider-specific errors (includes status_code)
├── StreamingError         # Streaming-related errors
├── AudioFormatError       # Audio format/conversion errors
├── AuthenticationError    # API key/authentication errors
├── ProviderNotFoundError  # Provider not found in registry
└── UnsupportedOperationError  # Operation not supported by provider

Usage

from litespeech import LiteSpeech
from litespeech.exceptions import (
    AuthenticationError,
    ProviderError,
    AudioFormatError,
    UnsupportedOperationError
)

try:
    text = await ls.speech_to_text(audio, provider="deepgram/nova-2")
except AuthenticationError as e:
    print(f"Auth failed for {e.provider}: {e}")
except ProviderError as e:
    print(f"Provider error (status {e.status_code}): {e}")
except AudioFormatError as e:
    print(f"Audio format issue: {e}")
except UnsupportedOperationError as e:
    print(f"Not supported: {e}")

Error Philosophy

Fail fast with actionable errors: Shows current state, expected state, and specific fixes
Warn, don't block: Non-optimal configs (like non-recommended sample rates) warn but proceed
Trust user for raw PCM: Can't validate format without headers - user must know their audio

Examples

FastAPI Voice Assistant

from fastapi import FastAPI, WebSocket
from litespeech import LiteSpeech
from openai import AsyncOpenAI

app = FastAPI()
ls = LiteSpeech()
openai = AsyncOpenAI()

@app.websocket("/voice-assistant")
async def voice_assistant(ws: WebSocket):
    await ws.accept()

    # ASR: Transcribe user speech
    async def audio_stream():
        while True:
            data = await ws.receive_bytes()
            if not data:
                break
            yield data

    async for result in ls.speech_to_text_stream(
        audio_stream=audio_stream(),
        provider="deepgram/nova-2",
        sample_rate=16000
    ):
        if not result.is_final:
            continue

        # LLM: Generate response
        llm_stream = await openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": result.text}],
            stream=True
        )

        # TTS: Stream audio back
        async for audio in ls.text_to_speech_stream(
            text_stream=llm_stream,
            provider="elevenlabs/eleven_turbo_v2_5",
            output_format="pcm_16000"
        ):
            await ws.send_bytes(audio)

Microphone Streaming with sounddevice

import sounddevice as sd
import queue
import asyncio
from collections.abc import AsyncIterator
from litespeech import LiteSpeech

# Audio configuration
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_SIZE = 4096

async def microphone_stream() -> AsyncIterator[bytes]:
    """Stream audio from microphone in real-time."""
    # Use thread-safe queue since callback runs in different thread
    audio_queue = queue.Queue()

    def audio_callback(indata, frames, time, status):
        if status:
            print(f"[Audio Status] {status}")
        # Copy data - sounddevice reuses buffers!
        audio_queue.put(indata.copy().tobytes())

    # Open microphone stream
    stream = sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype='int16',
        blocksize=CHUNK_SIZE // 2,
        callback=audio_callback,
    )

    with stream:
        while True:
            try:
                chunk = audio_queue.get(timeout=0.1)
                yield chunk
            except queue.Empty:
                await asyncio.sleep(0.01)
                continue

async def main():
    ls = LiteSpeech()

    async for result in ls.speech_to_text_stream(
        audio_stream=microphone_stream(),
        provider="deepgram/nova-2",
        language="en",
        sample_rate=SAMPLE_RATE,
        channels=CHANNELS,
        encoding="linear16",
        interim_results=True,
    ):
        if result.is_final:
            print(f"\n✓ {result.text}")
        else:
            print(f"\r  {result.text}...", end="", flush=True)

asyncio.run(main())

Batch Processing Multiple Files

import asyncio
from pathlib import Path
from litespeech import LiteSpeech

async def transcribe_all(directory: str):
    ls = LiteSpeech()
    audio_files = Path(directory).glob("*.wav")

    tasks = [
        ls.speech_to_text(str(f), provider="deepgram/nova-2")
        for f in audio_files
    ]

    results = await asyncio.gather(*tasks)
    return dict(zip(audio_files, results))

transcriptions = asyncio.run(transcribe_all("./recordings"))
for file, text in transcriptions.items():
    print(f"{file.name}: {text[:100]}...")

Development

Setup

# Clone repository
git clone https://github.com/your-org/litespeech.git
cd litespeech

# Install with dev dependencies
uv pip install -e ".[dev]"

# Install with audio support
uv pip install -e ".[audio]"

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_audio.py

# Run with coverage
pytest --cov=litespeech --cov-report=html

# Run specific test
pytest tests/test_audio.py::test_wav_to_wav_no_conversion -v

Linting & Type Checking

# Format and lint with ruff
ruff check litespeech/
ruff format litespeech/

# Type check with mypy
mypy litespeech/

Project Structure

litespeech/
├── __init__.py          # Public API exports
├── client.py            # Main LiteSpeech class
├── config.py            # API key configuration
├── exceptions.py        # Exception hierarchy
├── version.py           # Version info
├── providers/
│   ├── base.py          # Abstract provider interfaces
│   ├── registry.py      # Provider discovery and routing
│   ├── tts/             # TTS providers (elevenlabs, deepgram, cartesia, openai, azure)
│   └── asr/             # ASR providers (elevenlabs, deepgram, cartesia, openai, azure)
├── audio/
│   ├── types.py         # AudioFormat, AudioInfo, AudioChunk
│   ├── detection.py     # Format detection
│   ├── conversion.py    # Format conversion
│   ├── specs.py         # Provider specifications
│   └── stream_validator.py  # Stream validation
├── adapters/
│   ├── base.py          # StreamAdapter interface
│   ├── auto_detect.py   # LLM stream auto-detection
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── litellm_adapter.py
└── utils/
    ├── logging.py       # Logging setup
    └── debug.py         # Debug utilities

Adding a New Provider

Create provider class in providers/{tts,asr}/{provider_name}.py:

from litespeech.providers.base import ASRProvider, ProviderInfo, ProviderCapabilities

class MyProviderASRProvider(ASRProvider):
    """My Provider ASR implementation."""

    DEFAULT_MODEL = "my-model"

    def __init__(self, api_key: str | None = None):
        super().__init__(api_key)
        self._api_key = api_key or os.environ.get("MYPROVIDER_API_KEY")

    @property
    def info(self) -> ProviderInfo:
        return ProviderInfo(
            name="myprovider",
            display_name="My Provider",
            capabilities=ProviderCapabilities(asr_batch=True, asr_streaming=True),
            default_model=self.DEFAULT_MODEL,
        )

    @classmethod
    def get_audio_specs(cls, model: str | None = None) -> dict:
        return {"preferred": {"format": "wav"}, "recommended_sample_rate": 16000}

    async def speech_to_text(self, audio, model=None, language=None, **kwargs) -> str:
        # Implementation
        ...

    async def speech_to_text_stream(self, audio_stream, model=None, **kwargs):
        # Implementation
        ...

from .myprovider import MyProviderASRProvider

That's it! Your provider is now available: ls.speech_to_text(audio, provider="myprovider")

Publishing to PyPI

For maintainers: How to publish a new release

Update version in pyproject.toml:
```
version = "0.2.0"  # Bump version
```

Commit and tag:

git add .
git commit -m "Release v0.2.0"
git tag v0.2.0
git push origin main --tags

Build and publish:

# Clean old builds
rm -rf dist/ build/ *.egg-info

# Build with UV
uv build

# Test on TestPyPI (optional but recommended)
uv publish --publish-url https://test.pypi.org/legacy/

# Publish to PyPI
uv publish

What gets published:

Wheel file (.whl) - Contains litespeech/ package code
Source distribution (.tar.gz) - Contains code + examples + docs

Note: Examples are included in the source distribution and visible on PyPI, but not installed with pip install. Users can find examples on GitHub or by downloading the source tarball.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 26, 2026

0.1.0

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litespeech-0.2.0.tar.gz (81.7 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

litespeech-0.2.0-py3-none-any.whl (90.3 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file litespeech-0.2.0.tar.gz.

File metadata

Download URL: litespeech-0.2.0.tar.gz
Upload date: Jan 26, 2026
Size: 81.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.4

File hashes

Hashes for litespeech-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`591114166a5be221450b468ef398b7d9725fb63370965b2bcc1ec8f140830aaa`
MD5	`e13c8042b11e01845f816fdb3c670f78`
BLAKE2b-256	`e48b691f72dc40270006718105bdbe05e5f317d631b1708114d3e97ff5ca2ca7`

See more details on using hashes here.

File details

Details for the file litespeech-0.2.0-py3-none-any.whl.

File metadata

Download URL: litespeech-0.2.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 90.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.4

File hashes

Hashes for litespeech-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5221b57a167e0e783560e90f1893747f81fe5248f06042f8cfeee184bfb102f`
MD5	`bfcc469dd1bf01bc94646b2cae588eaa`
BLAKE2b-256	`5c06dcd8aad524b78014c42f40fc0d07391f78fc8e6df67213a919e832c4470c`

See more details on using hashes here.

litespeech 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LiteSpeech

Table of Contents

Features

Installation

Quick Start

Text-to-Speech

Speech-to-Text

LLM to TTS (Voice Assistant)

Sync Interface

Provider String Format

Supported Providers

API Reference

LiteSpeech Client

Text-to-Speech

Batch TTS

Streaming TTS

Speech-to-Text

Batch ASR

Streaming ASR

Sync Interface

LLM Integration

Supported LLM Providers

OpenAI Example

Anthropic Example

Plain Async Iterator (Simulated LLM)

Audio Processing

Audio Format Detection

Audio Conversion

Streaming Audio Parameters

ASR Streaming Results

Interim vs Final Results

Deduplication

Configuration

API Keys

Debug Logging

Provider-Specific Details

ElevenLabs

Deepgram

Cartesia

OpenAI

Azure Speech Services

Error Handling

Exception Hierarchy

Usage

Error Philosophy

Examples

FastAPI Voice Assistant

Microphone Streaming with sounddevice

Batch Processing Multiple Files

Development

Setup

Testing

Linting & Type Checking

Project Structure

Adding a New Provider

Publishing to PyPI

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes