Skip to main content

A simple FastAPI server to host XTTSv2

Project description

A simple FastAPI Server to run XTTSv2

This project is inspired by silero-api-server and utilizes XTTSv2.

This server was created for SillyTavern but you can use it for your needs

Feel free to make PRs or use the code for your own needs

There's a google collab version you can use it if your computer is weak.

If you want to get a high-quality voice clone, I advise you to use webui for fine-tuning xtts

If you are looking for an option for normal XTTS use look here https://github.com/daswer123/xtts-webui

Changelog

You can keep track of all changes on the release page

Installation

Simple installation :

pip install xtts-api-server

This will install all the necessary dependencies, including a CPU support only version of PyTorch

I recommend that you install the GPU version to improve processing speed ( up to 3 times faster )

Installation into virtual environment on Windows with GPU support:

python -m venv venv
venv\Scripts\activate
pip install xtts-api-server
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118

Starting Server

python -m xtts_api_server will run on default ip and port (localhost:8020)

usage: xtts_api_server [-h] [-hs HOST] [-p PORT] [-sf SPEAKER_FOLDER] [-o OUTPUT] [-t TUNNEL_URL] [-ms MODEL_SOURCE] [--lowvram] [--deepspeed] [--streaming-mode] [--stream-play-sync]

Run XTTSv2 within a FastAPI application

options:
  -h, --help show this help message and exit
  -hs HOST, --host HOST
  -p PORT, --port PORT
  -d DEVICE, --device DEVICE `cpu` or `cuda`, you can specify which video card to use, for example, `cuda:0`
  -sf SPEAKER_FOLDER, --speaker_folder The folder where you get the samples for tts
  -o OUTPUT, --output Output folder
  -t TUNNEL_URL, --tunnel URL of tunnel used (e.g: ngrok, localtunnel)
  -ms MODEL_SOURCE, --model-source ["api","apiManual","local"]
  -v MODEL_VERSION, --version You can choose any version of the model, keep in mind that if you choose model-source api, only the latest version will be loaded
  --lowvram The mode in which the model will be stored in RAM and when the processing will move to VRAM, the difference in speed is small
  --deepspeed allows you to speed up processing by several times, automatically downloads the necessary libraries
  --streaming-mode Enables streaming mode, currently has certain limitations, as described below.
  --streaming-mode-improve Enables streaming mode, includes an improved streaming mode that consumes 2gb more VRAM and uses a better tokenizer and more context.
  --stream-play-sync Additional flag for streaming mod that allows you to play all audio one at a time without interruption

If you want your host to listen, use -hs 0.0.0.0

The -t or --tunnel flag is needed so that when you get speakers via get you get the correct link to hear the preview. More info here

Model-source defines in which format you want to use xtts:

  1. local - loads version 2.0.2 by default, but you can specify the version via the -v flag, model saves into the models folder and uses XttsConfig and inference.
  2. apiManual - loads version 2.0.2 by default, but you can specify the version via the -v flag, model saves into the models folder and uses the tts_to_file function from the TTS api
  3. api - will load the latest version of the model. The -v flag won't work.

All versions of the XTTSv2 model can be found here in the branches

The first time you run or generate, you may need to confirm that you agree to use XTTS.

About Streaming mode

Streaming mode allows you to get audio and play it back almost immediately. However, it has a number of limitations.

You can see how this mode works here and here

Now, about the limitations

  1. Can only be used on a local computer
  2. Playing audio from the your pc
  3. Does not work endpoint tts_to_file only tts_to_audio and it returns 1 second of silence.

You can specify the version of the XTTS model by using the -v flag.

Improved streaming mode is suitable for complex languages such as Chinese, Japanese, Hindi or if you want the language engine to take more information into account when processing speech.

--stream-play-sync flag - Allows you to play all messages in queue order, useful if you use group chats. In SillyTavern you need to turn off streaming to work correctly

API Docs

API Docs can be accessed from http://localhost:8020/docs

Voice Samples

You can find the sample in this repository more details in the API documentation

erew123 put together a pack of 40+ votes, you can download to try them out here

Selecting Folder

You can change the folders for speakers and the folder for output via the API.

Get Speakers

Once you have at least one file in your speakers folder, you can get its name via API and then you only need to specify the file name.

Note on creating samples for quality voice cloning

The following post is a quote by user Material1276 from reddit

Some suggestions on making good samples

Keep them about 7-9 seconds long. Longer isn't necessarily better.

Make sure the audio is down sampled to a Mono, 22050Hz 16 Bit wav file. You will slow down processing by a large % and it seems cause poor quality results otherwise (based on a few tests). 24000Hz is the quality it outputs at anyway!

Using the latest version of Audacity, select your clip and Tracks > Resample to 22050Hz, then Tracks > Mix > Stereo to Mono. and then File > Export Audio, saving it as a WAV of 22050Hz

If you need to do any audio cleaning, do it before you compress it down to the above settings (Mono, 22050Hz, 16 Bit).

Ensure the clip you use doesn't have background noises or music on e.g. lots of movies have quiet music when many of the actors are talking. Bad quality audio will have hiss that needs clearing up. The AI will pick this up, even if we don't, and to some degree, use it in the simulated voice to some extent, so clean audio is key!

Try make your clip one of nice flowing speech, like the included example files. No big pauses, gaps or other sounds. Preferably one that the person you are trying to copy will show a little vocal range. Example files are in here

Make sure the clip doesn't start or end with breathy sounds (breathing in/out etc).

Using AI generated audio clips may introduce unwanted sounds as its already a copy/simulation of a voice, though, this would need testing.

Use Docker image with Docker Compose

A Dockerfile is provided to build a Docker image, and a docker-compose.yml file is provided to run the server with Docker Compose as a service.

You will need to setup the env variables by copying the .env.example file to .env and filling in the values. If you want to use your own speakers, you can put it in example folder before building the image. The example folder will be copied to the container and the server will use it as a speaker folder.

You can build the image with the following command:

cd docker
docker compose build

Then you can run the server with the following command:

docker compose up # or with -d to run in background

Credit

  1. Thanks to the author Kolja Beigel for the repository RealtimeTTS , I took some of its code for my project.
  2. Thanks erew123 for the note about creating samples and the code to download the models

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xtts_api_server-0.6.8.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

xtts_api_server-0.6.8-py3-none-any.whl (144.5 kB view details)

Uploaded Python 3

File details

Details for the file xtts_api_server-0.6.8.tar.gz.

File metadata

  • Download URL: xtts_api_server-0.6.8.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for xtts_api_server-0.6.8.tar.gz
Algorithm Hash digest
SHA256 e36132aedc2799ec01d68719632aa10b9a1882c0c2b838ecf9458320957ae089
MD5 ff9bb4e3954e52176e73e111bf7c53ec
BLAKE2b-256 025e8afe465b625528b3f4efb61b2be8bdab8d6a972c62ffa216934513c14adb

See more details on using hashes here.

File details

Details for the file xtts_api_server-0.6.8-py3-none-any.whl.

File metadata

File hashes

Hashes for xtts_api_server-0.6.8-py3-none-any.whl
Algorithm Hash digest
SHA256 e47a381c86c5d158b7f37ad413272743c77f480c278f36efe24f314345ea9d3b
MD5 6f42fee55c19d537f5079d01ddf74596
BLAKE2b-256 4f49bc109b0884ddf3702a70c463cdf93e1d9531dfe3592af4c3f331fe6851b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page