Facade for voice cloning and speech synthesis

Project description

voxy

Facade for voice cloning and speech synthesis

To install: pip install voxy

Voxy is a flexible Python module for speech synthesis and voice cloning, with initial support for the Sesame CSM-1B model. It provides a plugin architecture that can be extended to support other models in the future.

Features

Voice cloning from audio samples
High-quality speech synthesis
Flexible input formats (file paths, bytes, streams, tensors)
Audio cleanup utilities
Automatic audio transcription (using Whisper)
Plugin architecture for different speech models

Installation

Prerequisites

Python 3.10+
PyTorch and TorchAudio
CUDA-compatible GPU (recommended)
FFmpeg for audio processing

Install the CSM Model

The intention is to make voxy into a plugin-enabled facade, where you can chose your own engine (for voice cloning, voice synthesis, etc.). But for now, we just support, what seems to be the best open-source model out there (at the time of writing this): Sesame AI Lab's CSM model. It's just that, well, they did an amazing job at the model, but a terrible one (so far) for the python interface -- which is what inspired me to develop voxy in the first place.

Follow the instructions in the CSM repository to install the CSM model and its dependencies.

Quick Start

Basic Usage

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Generate speech with default voice
audio = model.generate_speech(
    text="Hello, this is a test of the CSM speech model.", 
    output_path="output.wav"
)

Voice Cloning

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Clone a voice from an audio file
voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="This is a sample of my voice for cloning purposes."
)

# Generate speech with the cloned voice
audio = model.generate_speech(
    text="This is my cloned voice speaking. Isn't it amazing?",
    voice_profile=voice_profile,
    output_path="cloned_voice.wav"
)

Automatic Transcription

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Clone a voice with automatic transcription
voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    # No transcript provided, will use automatic transcription
)

# Generate speech with the cloned voice
audio = model.generate_speech(
    text="This voice was cloned using automatic transcription.",
    voice_profile=voice_profile,
    output_path="auto_transcribed_voice.wav"
)

Flexible Input Formats

The module supports various input formats:

# From file path
voice_profile1 = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="Text transcript."
)

# From bytes
with open("sample_voice.wav", "rb") as f:
    audio_bytes = f.read()
voice_profile2 = model.clone_voice(
    audio_input=audio_bytes,
    transcript="Text transcript."
)

# From file object
with open("sample_voice.wav", "rb") as f:
    voice_profile3 = model.clone_voice(
        audio_input=f,
        transcript="Text transcript."
    )

# From tensor
import torch
import torchaudio
audio_tensor, sample_rate = torchaudio.load("sample_voice.wav")
voice_profile4 = model.clone_voice(
    audio_input=audio_tensor,
    transcript="Text transcript."
)

Configuration

You can configure the default device by setting the DFLT_VOXY_DEVICE environment variable:

# Use CUDA
export DFLT_VOXY_DEVICE=cuda

# Use CPU
export DFLT_VOXY_DEVICE=cpu

# Use MPS (Apple Silicon)
export DFLT_VOXY_DEVICE=mps

Advanced Usage

Audio Cleanup

The module includes an audio cleanup function that normalizes volume and removes silence:

from voxy import cleanup_audio
import torchaudio

# Load audio
audio, sample_rate = torchaudio.load("noisy_audio.wav")

# Clean up audio
cleaned_audio = cleanup_audio(
    audio=audio,
    sample_rate=sample_rate,
    normalize=True,
    remove_silence=True,
    silence_threshold=0.02,
    min_silence_duration=0.2
)

# Save cleaned audio
torchaudio.save("cleaned_audio.wav", cleaned_audio, sample_rate)

Disabling Audio Cleanup

You can disable audio cleanup when cloning a voice:

voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="This is a sample of my voice.",
    cleanup_audio_fn=None  # Disable audio cleanup
)

Custom Audio Cleanup

You can also provide your own audio cleanup function:

def my_custom_cleanup(audio, sample_rate, **kwargs):
    # Custom cleanup logic
    return processed_audio

voice_profile = model.

Project details

Release history Release notifications | RSS feed

0.0.3

Mar 24, 2025

This version

0.0.2

Mar 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxy-0.0.2.tar.gz (9.7 kB view details)

Uploaded Mar 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voxy-0.0.2-py3-none-any.whl (8.7 kB view details)

Uploaded Mar 24, 2025 Python 3

File details

Details for the file voxy-0.0.2.tar.gz.

File metadata

Download URL: voxy-0.0.2.tar.gz
Upload date: Mar 24, 2025
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for voxy-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0f8d0a1a3820457a9a5975fcd230c1e17331bec8ba41bb81188444bc57a61bd5`
MD5	`8591fb76503f48623ec1151668761d6f`
BLAKE2b-256	`323b1d6cd3a970350e908291208d171ea89e3c7616a506e7cc0cf48d638861ca`

See more details on using hashes here.

File details

Details for the file voxy-0.0.2-py3-none-any.whl.

File metadata

Download URL: voxy-0.0.2-py3-none-any.whl
Upload date: Mar 24, 2025
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for voxy-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f508860d51a963b6eb924b90b4a1cded9242ebb00073cf2dd4b6739a030a133b`
MD5	`8393db14041f27590bf7f1a883db8ce2`
BLAKE2b-256	`d9ae1191bc0ce53dea141abd2c81436e2a1f43d82bcbc2c95df46478cd6c4097`

See more details on using hashes here.

voxy 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

voxy

Features

Installation

Prerequisites

Install the CSM Model

Quick Start

Basic Usage

Voice Cloning

Automatic Transcription

Flexible Input Formats

Configuration

Advanced Usage

Audio Cleanup

Disabling Audio Cleanup

Custom Audio Cleanup

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes