Low-latency speech-to-speech pipeline

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

filsino

These details have not been verified by PyPI

Project description

Speech To Speech: Build local voice agents with open-source models

📖 Quick Index

Approach
- Structure
- Modularity
Setup
Usage
Command-line usage

Approach

Structure

This repository implements a speech-to-speech cascaded pipeline consisting of the following parts:

Voice Activity Detection (VAD)
Speech to Text (STT)
Language Model (LM)
Text to Speech (TTS)

Modularity

The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:

VAD

Silero VAD v5

STT

Any Whisper model checkpoint on the Hugging Face Hub through Transformers 🤗, including whisper-large-v3 and distil-large-v3
Lightning Whisper MLX
MLX Audio Whisper - Fast Whisper inference on Apple Silicon
Parakeet TDT - Real-time streaming STT with sub-100ms latency on Apple Silicon (CUDA/CPU via nano-parakeet, no NeMo)
Paraformer - FunASR

LLM

Any instruction-following model on the Hugging Face Hub via Transformers 🤗
mlx-lm
OpenAI API

TTS

ChatTTS
Pocket TTS - Streaming TTS with voice cloning from Kyutai Labs
Kokoro-82M - Fast and high-quality TTS optimized for Apple Silicon
Qwen3-TTS

Setup

Install the default package from PyPI:

pip install speech-to-speech

The default install is scoped to the standard realtime voice-agent path:

Parakeet TDT for STT
OpenAI-compatible API for the language model
Qwen3-TTS for speech output
local audio and realtime server modes

Optional backends are installed with extras:

pip install "speech-to-speech[kokoro]"
pip install "speech-to-speech[pocket]"
pip install "speech-to-speech[faster-whisper]"
pip install "speech-to-speech[paraformer]"
pip install "speech-to-speech[mlx-lm]"
pip install "speech-to-speech[websocket]"

Deprecated model implementations, including MeloTTS, live in archive/ and are no longer wired into the CLI.

For development from source:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
uv sync

This installs the speech_to_speech package in editable mode and makes the speech-to-speech CLI command available. The project uses a single pyproject.toml with platform markers, so macOS and non-macOS dependencies are resolved automatically from one file.

Note on DeepFilterNet: DeepFilterNet (used for optional audio enhancement in VAD) requires numpy<2 and conflicts with Pocket TTS, which requires numpy>=2. Install DeepFilterNet manually only in environments where you are not using Pocket TTS.

Usage

The default CLI is equivalent to a realtime Parakeet + OpenAI-compatible LLM + Qwen3-TTS setup. It uses OPENAI_API_KEY from the environment unless --open_api_api_key is provided:

speech-to-speech

The pipeline can be run in four ways:

Realtime approach: Models run locally or on a server, and an OpenAI Realtime-compatible WebSocket API is exposed for another app.
Server/Client approach: Models run on a server, and audio input/output are streamed from a client using TCP sockets.
WebSocket approach: Models run on a server, and audio input/output are streamed from a client using WebSockets.
Local approach: Runs locally.

Recommended setup

Realtime Approach

The default realtime setup uses --llm open_api, so it needs an OpenAI API key. Export OPENAI_API_KEY before launching, or pass --open_api_api_key explicitly. For a deployed OpenAI-compatible LLM, also set --open_api_base_url.

export OPENAI_API_KEY=...

The default mode starts the OpenAI Realtime-compatible server:

speech-to-speech

This is equivalent to:

speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm open_api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --open_api_model_name gpt-5.4-mini \
    --open_api_chat_size 30 \
    --open_api_stream \
    --enable_live_transcription \
    --mode realtime

Server/Client Approach

Run the pipeline on the server:

speech-to-speech --recv_host 0.0.0.0 --send_host 0.0.0.0

Run the client locally to handle microphone input and receive generated audio:
```
python scripts/listen_and_play.py --host <IP address of your server>
```

WebSocket Approach

Run the pipeline with WebSocket mode:

speech-to-speech --mode websocket --ws_host 0.0.0.0 --ws_port 8765

Connect to the WebSocket server from your client application at ws://<server-ip>:8765. The server handles bidirectional audio streaming:
- Send raw audio bytes to the server (16kHz, int16, mono)
- Receive generated audio bytes from the server

Local Approach (Mac)

For optimal settings on Mac:

speech-to-speech --local_mac_optimal_settings

You can also specify a particular LLM model:

speech-to-speech \
    --local_mac_optimal_settings \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

This setting:

Adds --device mps to use MPS for all models.
Sets Parakeet TDT for STT (fast streaming ASR on Apple Silicon)
Sets MLX LM for the language model (uses --lm_model_name to specify the model)
Sets Qwen3-TTS for TTS
--tts pocket and --tts kokoro are also valid TTS options on macOS.
Qwen3 on Apple Silicon uses mlx-audio and defaults to the 6bit MLX variant unless you explicitly select a different quantization or model suffix.

To compare the MLX variants locally, run:

python scripts/benchmark_tts.py \
    --handlers qwen3 \
    --iterations 3 \
    --qwen3_mlx_quantizations bf16 4bit 6bit 8bit

Docker Server

Install the NVIDIA Container Toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Start the docker container

docker compose up

Recommended usage with Cuda

Leverage Torch Compile for Whisper with Pocket TTS for a simple low-latency setup:

speech-to-speech \
	--lm_model_name microsoft/Phi-3-mini-4k-instruct \
	--stt_compile_mode reduce-overhead \
  --tts pocket \
  --recv_host 0.0.0.0 \
	--send_host 0.0.0.0

Multi-language Support

The pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.
Two use cases are considered:

Single-language conversation: Enforce the language setting using the --language flag, specifying the target language code (default is 'en').
Language switching: Set --language to 'auto'. The STT detects the language of each spoken prompt and forwards it to the LLM. Optionally, opt in with --lm_enable_lang_prompt (or --open_api_enable_lang_prompt for the OpenAI-compatible backend) to also append a "Please reply to my message in ..." instruction so the LLM replies in the detected language. Both flags default to False — large LLMs usually pick up the language from context on their own, but the explicit instruction can help smaller models stay in the right language.

Please note that you must use STT and LLM checkpoints compatible with the target language(s). For multilingual TTS, use ChatTTS or another backend that supports the target language.

With the server version:

For automatic language detection:

speech-to-speech \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language auto \
    --llm mlx-lm \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Or for one language in particular, chinese in this example

speech-to-speech \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --llm mlx-lm \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Local Mac Setup

For automatic language detection (note: --stt whisper-mlx overrides the default parakeet-tdt from optimal settings, since Whisper large-v3 has broader language coverage):

speech-to-speech \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language auto \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Or for one language in particular, chinese in this example

speech-to-speech \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Using Pocket TTS

Pocket TTS from Kyutai Labs provides streaming TTS with voice cloning capabilities. To use it:

speech-to-speech \
    --tts pocket \
    --pocket_tts_voice jean \
    --pocket_tts_device cpu

Available voice presets: alba, marius, javert, jean, fantine, cosette, eponine, azelma. You can also use custom voice files or HuggingFace paths.

Command-line Usage

NOTE: References for all the CLI arguments can be found directly in the arguments classes or by running speech-to-speech -h.

Module level Parameters

See ModuleArguments class. Allows to set:

a common --device (if one wants each part to run on the same device)
--mode local or server
chosen STT implementation
chosen LM implementation
chose TTS implementation
logging level

VAD parameters

See VADHandlerArguments class. Notably:

--thresh: Threshold value to trigger voice activity detection.
--min_speech_ms: Minimum duration of detected voice activity to be considered speech.
--min_silence_ms: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.

STT, LM and TTS parameters

model_name, torch_dtype, and device are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. stt, lm or tts, check the implementations' arguments classes for more details).

For example:

--lm_model_name google/gemma-2b-it

Generation parameters

Other generation parameters of the model's generate method can be set using the part's prefix + _gen_, e.g., --stt_gen_max_new_tokens 128. These parameters can be added to the pipeline part's arguments class if not already exposed.

Citations

Silero VAD

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Distil-Whisper

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Parler-TTS

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

filsino

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

May 5, 2026

0.2.5

May 4, 2026

0.2.4

May 4, 2026

0.2.3

May 4, 2026

This version

0.2.2

May 4, 2026

0.2.1

May 4, 2026

0.2.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speech_to_speech-0.2.2.tar.gz (264.5 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speech_to_speech-0.2.2-py3-none-any.whl (291.9 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file speech_to_speech-0.2.2.tar.gz.

File metadata

Download URL: speech_to_speech-0.2.2.tar.gz
Upload date: May 4, 2026
Size: 264.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for speech_to_speech-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`b9f00325f8f62dab42b6fa048f9541de48c0d8a40ec227072603fdb90ba58cce`
MD5	`e4826da2b38a608e1bb50d14f76135ad`
BLAKE2b-256	`3a10b7da68bd0bc5501486feb77bc674dc1ccb368f004e4e3a94b50be0e49f4e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for speech_to_speech-0.2.2.tar.gz:

Publisher: publish.yml on huggingface/speech-to-speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: speech_to_speech-0.2.2.tar.gz
- Subject digest: b9f00325f8f62dab42b6fa048f9541de48c0d8a40ec227072603fdb90ba58cce
- Sigstore transparency entry: 1437313885
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: huggingface/speech-to-speech@24a38ad2244d28419fa5141fdfba786f161eca87
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@24a38ad2244d28419fa5141fdfba786f161eca87
- Trigger Event: push

File details

Details for the file speech_to_speech-0.2.2-py3-none-any.whl.

File metadata

Download URL: speech_to_speech-0.2.2-py3-none-any.whl
Upload date: May 4, 2026
Size: 291.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for speech_to_speech-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e91c023e7aef897d84920c0836d42d35ad13cd3a617484e84def1685697f59d`
MD5	`47454ae09b75d745a6f29cd16de7666e`
BLAKE2b-256	`1035147aff201cb4dc936f4a41925c3305c57d0d804e09560c06ef789e3a00c7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for speech_to_speech-0.2.2-py3-none-any.whl:

Publisher: publish.yml on huggingface/speech-to-speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: speech_to_speech-0.2.2-py3-none-any.whl
- Subject digest: 5e91c023e7aef897d84920c0836d42d35ad13cd3a617484e84def1685697f59d
- Sigstore transparency entry: 1437313896
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: huggingface/speech-to-speech@24a38ad2244d28419fa5141fdfba786f161eca87
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@24a38ad2244d28419fa5141fdfba786f161eca87
- Trigger Event: push

speech-to-speech 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Speech To Speech: Build local voice agents with open-source models

📖 Quick Index

Approach

Structure

Modularity

Setup

Usage

Recommended setup

Realtime Approach

Server/Client Approach

WebSocket Approach

Local Approach (Mac)

Docker Server

Install the NVIDIA Container Toolkit

Start the docker container

Recommended usage with Cuda

Multi-language Support

With the server version:

Local Mac Setup

Using Pocket TTS

Command-line Usage

Module level Parameters

VAD parameters

STT, LM and TTS parameters

Generation parameters

Citations

Silero VAD

Distil-Whisper

Parler-TTS

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance