A high-performance inference engine specifically designed for the GPT-SoVITS text-to-speech model

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

GSV-TTS-Lite

A high-performance inference engine specifically designed for the GPT-SoVITS text-to-speech model

About

The original motivation for this project was the pursuit of ultimate performance. While using the original GPT-SoVITS, I found that the inference latency often struggled to meet the demands of real-time interaction due to the computing power bottlenecks of the RTX 3050 (Laptop).

To break through these limitations, GSV-TTS-Lite was developed as an inference backend based on GPT-SoVITS V2Pro. Through deep optimization techniques, this project successfully achieves millisecond-level real-time response in low-VRAM environments.

Beyond the leap in performance, GSV-TTS-Lite implements the decoupling of timbre and style, supporting independent control over the speaker's voice and emotion. It also features subtitle timestamp alignment and voice conversion (timbre transfer).

To facilitate integration for developers, GSV-TTS-Lite features a significantly streamlined code architecture and is available on PyPI as the gsv-tts-lite library, supporting one-click installation via pip.

The currently supported languages are Chinese, Japanese, and English. The available models include v2pro and v2proplus.

Performance Comparison

[!NOTE] Test Environment: NVIDIA GeForce RTX 3050 (Laptop)

Backend	Settings	TTFT (First Packet)	RTF (Real-time Factor)	VRAM	Speedup
Original	`streaming_mode=3`	436 ms	0.381	1.6 GB	-
Lite Version	`Flash_Attn=Off`	150 ms	0.125	0.8 GB	⚡ 2.9x Speed
Lite Version	`Flash_Attn=On`	133 ms	0.108	0.8 GB	🔥 3.3x Speed

As shown, GSV-TTS-Lite achieves 3x ~ 4x speed improvements while halving the VRAM usage! 🚀

Deployment (For Developers)

Prerequisites

CUDA Toolkit
Microsoft Visual C++

Installation Steps

1. Environment Configuration

It is recommended to create a virtual environment using Python >=3.10.

# Install PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

2. Install GSV-TTS-Lite

If you have prepared the above basic environment, you can directly execute the following command to complete the integration:

pip install gsv-tts-lite==0.2.6 --prefer-binary

Quick Start

[!TIP] The program will automatically download the required pre-trained models upon the first run.

1. Basic Inference

from gsv_tts import TTS

tts = TTS()
# tts = TTS(use_bert=True) # Recommended setting for better Chinese synthesis results.
# tts = TTS(use_flash_attn=True) # Recommended setting if Flash Attention is installed.

# Load GPT model weights from the specified path into memory; loads the default model here.
tts.load_gpt_model()

# Load SoVITS model weights from the specified path into memory; loads the default model here.
tts.load_sovits_model()

# Pre-load and cache resources to significantly reduce latency during the first inference.
# tts.init_language_module("ja")
# tts.cache_spk_audio("examples\laffey.mp3")
# tts.cache_prompt_audio(
#     prompt_audio_paths="examples\AnAn.ogg",
#     prompt_audio_texts="ちが……ちがう。レイア、貴様は間違っている。",
# )

# 'infer' is the simplest and most basic inference method, suitable for short text generation.
audio = tts.infer(
    spk_audio_path="examples\laffey.mp3", # Voice reference audio (Timbre)
    prompt_audio_path="examples\AnAn.ogg", # Style reference audio (Prompt)
    prompt_audio_text="ちが……ちがう。レイア、貴様は間違っている。", # The corresponding text for the style reference audio
    text="へぇー、ここまでしてくれるんですね。", # Target text to be generated
    # gpt_model = None, # Path to the GPT model for inference; defaults to the first loaded GPT model.
    # sovits_model = None, # Path to the SoVITS model for inference; defaults to the first loaded SoVITS model.
)

audio.play()
tts.audio_queue.wait()
# tts.audio_queue.stop() # Stop playback

2. Stream Inference / Subtitle Synchronization

import time
import queue
import threading
from gsv_tts import TTS

class SubtitlesQueue:
    def __init__(self):
        self.q = queue.Queue()
        self.t = None
    
    def process(self):
        last_i = 0
        last_t = time.time()

        while True:
            subtitles, text = self.q.get()
            
            if subtitles is None:
                break

            for subtitle in subtitles:
                if subtitle["start_s"] > time.time() - last_t:
                    while time.time() - last_t <= subtitle["start_s"]:
                        time.sleep(0.01)

                if subtitle["end_s"] and subtitle["end_s"] > time.time() - last_t:
                    if subtitle["orig_idx_end"] > last_i:
                        print(text[last_i:subtitle["orig_idx_end"]], end="", flush=True)
                        last_i = subtitle["orig_idx_end"]
                        while time.time() - last_t <= subtitle["end_s"]:
                            time.sleep(0.01)

        self.t = None
    
    def add(self, subtitles, text):
        self.q.put((subtitles, text))
        if self.t is None:
            self.t = threading.Thread(target=self.process, daemon=True)
            self.t.start()

tts = TTS()

# infer, infer_stream, and infer_batched all support returning subtitle timestamps; infer_stream is used here just as an example.
subtitlesqueue = SubtitlesQueue()

# infer_stream implements token-level streaming output, significantly reducing first-token latency and enabling a ultra-low latency real-time feedback experience.
generator = tts.infer_stream(
    spk_audio_path="examples\laffey.mp3",
    prompt_audio_path="examples\AnAn.ogg",
    prompt_audio_text="ちが……ちがう。レイア、貴様は間違っている。",
    text="へぇー、ここまでしてくれるんですね。",
    debug=False,
)

for audio in generator:
    audio.play()
    subtitlesqueue.add(audio.subtitles, audio.orig_text)

tts.audio_queue.wait()
subtitlesqueue.add(None, None)
print()

3. Batched Inference

from gsv_tts import TTS

tts = TTS()

# infer_batched is optimized specifically for long-form text and multi-sentence synthesis scenarios. This mode not only offers significant advantages in processing efficiency but also supports assigning different reference audios to different sentences within the same batch, providing high synthesis freedom and flexibility.
audios = tts.infer_batched(
    spk_audio_paths="examples\laffey.mp3",
    prompt_audio_paths="examples\AnAn.ogg",
    prompt_audio_texts="ちが……ちがう。レイア、貴様は間違っている。",
    texts=["へぇー、ここまでしてくれるんですね。", "The old map crinkled in Leo’s trembling hands."],
)

for i, audio in enumerate(audios):
    audio.save(f"audio{i}.wav")

4. Voice Conversion

from gsv_tts import TTS

tts = TTS()

# Although infer_vc supports few-shot voice conversion and offers convenience, its conversion quality still has room for improvement compared to specialized voice conversion models like RVC or SVC.
audio = tts.infer_vc(
    spk_audio_path="examples\laffey.mp3",
    prompt_audio_path="examples\AnAn.ogg",
    prompt_audio_text="ちが……ちがう。レイア、貴様は間違っている。",
)

audio.play()
tts.audio_queue.wait()

5. Speaker Verification

from gsv_tts import TTS

tts = TTS(always_load_sv=True)

# verify_speaker is used to compare the speaker characteristics of two audio clips to determine if they are the same person.
similarity = tts.verify_speaker("examples\laffey.mp3", "examples\AnAn.ogg")
print("Speaker Similarity:", similarity)

6. Other Function Interfaces

1. Model Management

`init_language_module(languages)`

Preload necessary language processing modules.

`load_gpt_model(model_paths)`

Load GPT model weights from specified paths into memory.

`load_sovits_model(model_paths)`

Load SoVITS model weights from specified paths into memory.

`unload_gpt_model(model_paths)` / `unload_sovits_model(model_paths)`

Unload models from memory to free up resources.

`get_gpt_list()` / `get_sovits_list()`

Get the list of currently loaded models.

`to_safetensors(checkpoint_path)`

Converts PyTorch checkpoint files (.pth or .ckpt) into the safetensors format.

2. Audio Cache Management

`cache_spk_audio(spk_audio_paths)`

Preprocess and cache speaker reference audio data.

`cache_prompt_audio(prompt_audio_paths, prompt_audio_texts, prompt_audio_languages)`

Preprocess and cache prompt reference audio data.

`del_spk_audio(spk_audio_list)` / `del_prompt_audio(prompt_audio_paths)`

Remove audio data from the cache.

`get_spk_audio_list()` / `get_prompt_audio_list()`

Get the list of audio data in the cache.

Flash Attn

If you are looking for lower latency and higher throughput, it is highly recommended to enable Flash Attention support. Since this library has specific compilation requirements, please install it manually based on your system:

🐧 Linux / Build from Source
- Official Repo: Dao-AILab/flash-attention
🪟 Windows Users
- Pre-compiled Wheels: lldacing/flash-attention-windows-wheel

[!TIP] After installation, set use_flash_attn=True in your TTS configuration to enjoy the acceleration! 🚀

Credits

Special thanks to the following projects:

RVC-Boss/GPT-SoVITS

⭐ Star History

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.0

May 2, 2026

0.3.18

May 2, 2026

0.3.17

May 1, 2026

0.3.14

Apr 25, 2026

0.3.13

Apr 18, 2026

0.3.12

Apr 17, 2026

0.3.11

Apr 11, 2026

0.3.10

Apr 5, 2026

0.3.9

Apr 4, 2026

0.3.8

Mar 28, 2026

0.3.7

Mar 21, 2026

0.3.6

Mar 21, 2026

0.3.5

Mar 20, 2026

0.3.4

Mar 20, 2026

0.3.3

Mar 14, 2026

0.3.2

Mar 14, 2026

0.3.1

Mar 14, 2026

This version

0.3.0

Mar 14, 2026

0.2.7

Mar 8, 2026

0.2.6

Mar 3, 2026

0.2.5

Mar 3, 2026

0.2.4

Mar 2, 2026

0.2.3

Feb 24, 2026

0.2.2

Feb 21, 2026

0.2.1

Feb 20, 2026

0.2.0

Feb 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsv_tts_lite-0.3.0.tar.gz (85.3 kB view details)

Uploaded Mar 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gsv_tts_lite-0.3.0-py3-none-any.whl (98.8 kB view details)

Uploaded Mar 14, 2026 Python 3

File details

Details for the file gsv_tts_lite-0.3.0.tar.gz.

File metadata

Download URL: gsv_tts_lite-0.3.0.tar.gz
Upload date: Mar 14, 2026
Size: 85.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gsv_tts_lite-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`27bce27ff2184a12179e146467cbf8510335c42cfe5a86df1a29e9f04819354c`
MD5	`5a2ce1b90f021e1f2e6a6847dfe705a7`
BLAKE2b-256	`25c0ecb144d6b7176dbb641d80dbab0e12232ad9d1d176c9544bf66d6a6d2d8a`

See more details on using hashes here.

File details

Details for the file gsv_tts_lite-0.3.0-py3-none-any.whl.

File metadata

Download URL: gsv_tts_lite-0.3.0-py3-none-any.whl
Upload date: Mar 14, 2026
Size: 98.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for gsv_tts_lite-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f00d9ae65ff5c551c06ae7d273ff3d823558005b52700fcf703f64d9b4fa446`
MD5	`f10365d49231682e0a87c25fab692dd6`
BLAKE2b-256	`12cc82405ecb0679fcc91743e39cf75b79f9a24e52dd21afaaa2462d9eb4cbf9`

See more details on using hashes here.

gsv-tts-lite 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GSV-TTS-Lite

About

Performance Comparison

Deployment (For Developers)

Prerequisites

Installation Steps

1. Environment Configuration

2. Install GSV-TTS-Lite

Quick Start

1. Basic Inference

2. Stream Inference / Subtitle Synchronization

3. Batched Inference

4. Voice Conversion

5. Speaker Verification

1. Model Management

init_language_module(languages)

load_gpt_model(model_paths)

load_sovits_model(model_paths)

unload_gpt_model(model_paths) / unload_sovits_model(model_paths)

get_gpt_list() / get_sovits_list()

to_safetensors(checkpoint_path)

2. Audio Cache Management

cache_spk_audio(spk_audio_paths)

cache_prompt_audio(prompt_audio_paths, prompt_audio_texts, prompt_audio_languages)

del_spk_audio(spk_audio_list) / del_prompt_audio(prompt_audio_paths)

get_spk_audio_list() / get_prompt_audio_list()

Flash Attn

Credits

⭐ Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init_language_module(languages)`

`load_gpt_model(model_paths)`

`load_sovits_model(model_paths)`

`unload_gpt_model(model_paths)` / `unload_sovits_model(model_paths)`

`get_gpt_list()` / `get_sovits_list()`

`to_safetensors(checkpoint_path)`

`cache_spk_audio(spk_audio_paths)`

`cache_prompt_audio(prompt_audio_paths, prompt_audio_texts, prompt_audio_languages)`

`del_spk_audio(spk_audio_list)` / `del_prompt_audio(prompt_audio_paths)`

`get_spk_audio_list()` / `get_prompt_audio_list()`