Pipecat community TTS integration for the XTTSv2-vLLM streaming server
Project description
pipecat-xtts-vllm
A Pipecat community TTS integration that streams synthesized speech from an XTTSv2-vLLM streaming server.
XTTSVLLMTTSService is a drop-in Pipecat TTSService that:
- Clones a voice from a reference audio clip.
- Computes XTTSv2 conditioning once (via
POST /v1/tts/conditioning) and caches it for the service lifetime — no per-request conditioning overhead. - Streams raw PCM audio chunks (via
POST /v1/audio/speech) directly into the Pipecat pipeline.
Installation
pip install pipecat-xtts-vllm
To work on the package from source instead:
pip install -e .
Start the XTTSv2-vLLM server
The client talks to the heavy Docker server at wuxuedaifu/xttsv2-vllm-streaming-server. Follow its README to pull and run the image, for example:
docker run --gpus all -p 8000:8000 ghcr.io/wuxuedaifu/xttsv2-vllm-streaming-server:latest
Usage with a Pipeline
The snippet below shows the essential setup. See
examples/foundational/xtts_vllm_say_one_thing.py
for the full, runnable version.
import asyncio
from pathlib import Path
from pipecat.frames.frames import EndFrame, TTSSpeakFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import PipelineParams, PipelineWorker
from pipecat.workers.runner import WorkerRunner
from pipecat_xtts_vllm import XTTSVLLMTTSService
async def main():
reference_audio = Path("reference.wav").read_bytes()
tts = XTTSVLLMTTSService(
base_url="http://localhost:8000",
reference_audio=reference_audio,
language="en",
sample_rate=24000,
)
# Add tts (and any downstream processors) to a Pipeline.
pipeline = Pipeline([tts, ...])
worker = PipelineWorker(
pipeline,
params=PipelineParams(audio_out_sample_rate=24000),
idle_timeout_secs=None,
)
runner = WorkerRunner()
await runner.add_workers(worker)
async def say():
await worker.queue_frames([
TTSSpeakFrame("Hello from the XTTSv2 vLLM streaming server."),
EndFrame(),
])
await asyncio.gather(runner.run(), say())
asyncio.run(main())
Alternatively, pass a precomputed XTTSVLLMConditioning object instead of reference_audio if
you have cached the conditioning data externally:
from pipecat_xtts_vllm import XTTSVLLMConditioning, XTTSVLLMTTSService
conditioning = XTTSVLLMConditioning(
gpt_cond_latent_b64="<base64-encoded latent>",
speaker_embeddings_b64="<base64-encoded embeddings>",
)
tts = XTTSVLLMTTSService(
base_url="http://localhost:8000",
conditioning=conditioning,
)
Running the Example
Set the required environment variables, then run the script:
export XTTS_VLLM_BASE_URL=http://localhost:8000
export XTTS_VLLM_REFERENCE_AUDIO=/path/to/reference.wav
python examples/foundational/xtts_vllm_say_one_thing.py
The script synthesizes one sentence and writes the output to output.wav in the current
directory.
Configuration
XTTSVLLMTTSService accepts keyword-only arguments. At least one of reference_audio or
conditioning must be given; if both are provided, conditioning takes precedence.
| Parameter | Default | Description |
|---|---|---|
base_url |
(required) | Base URL of the XTTSv2-vLLM streaming server (e.g. http://localhost:8000). |
reference_audio |
None |
Raw bytes of a reference WAV clip (~6 s) used for voice cloning. Required unless conditioning is given. |
conditioning |
None |
Optional precomputed XTTSVLLMConditioning (skips the /v1/tts/conditioning call); if set, it takes precedence over reference_audio. |
language |
"en" |
Language code passed to the server (see Supported languages). |
chunk_size |
20 |
Number of tokens per streaming chunk. |
speed |
1.0 |
Speech rate multiplier. |
sample_rate |
24000 |
PCM sample rate in Hz (should match the server output). |
aiohttp_session |
None |
External aiohttp.ClientSession to reuse. If None, a session is created and closed by the service. |
Supported languages
XTTSv2 supports 17 languages. Pass the matching code as the language argument:
| Code | Language | Code | Language |
|---|---|---|---|
en |
English | nl |
Dutch |
es |
Spanish | cs |
Czech |
fr |
French | ar |
Arabic |
de |
German | zh-cn |
Chinese (Simplified) |
it |
Italian | hu |
Hungarian |
pt |
Portuguese | ko |
Korean |
pl |
Polish | ja |
Japanese |
tr |
Turkish | hi |
Hindi |
ru |
Russian |
Pass auto to let the server auto-detect the language.
Compatibility
Supports Python 3.11+; last tested with pipecat-ai v1.4.0 on Python 3.12.
License
Integration code: MIT — see LICENSE.
XTTSv2 model weights: distributed under the Coqui Public Model License (non-commercial use only). See the server repository for details.
Attribution
Developed by wuxuedaifu.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipecat_xtts_vllm-0.1.1.tar.gz.
File metadata
- Download URL: pipecat_xtts_vllm-0.1.1.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23e4d4948af9412b9e9dc805a4ae1b13381c3d9521400aa5abc7aad9f68ab70f
|
|
| MD5 |
6acfccea4810321a0c25e6fb0aa51fd2
|
|
| BLAKE2b-256 |
2dc04937a34602282445334505330fa62592994410b40a995523c8848302c095
|
File details
Details for the file pipecat_xtts_vllm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pipecat_xtts_vllm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fa6c615b4edf93784ac8d4bcda3d12c3ebf97cef098a29b1dca0d30edd8e8e1
|
|
| MD5 |
a369503fa202a839aae0ce4994e41e10
|
|
| BLAKE2b-256 |
88861f0fe26b38f0940dabfb2afdf743abac14ea6478ee12e13d7da685308085
|