Facade for voice cloning and speech synthesis
Project description
voxy
Facade for voice cloning and speech synthesis
To install: pip install voxy
Voxy is a flexible Python module for speech synthesis and voice cloning, with initial support for the Sesame CSM-1B model. It provides a plugin architecture that can be extended to support other models in the future.
Features
- Voice cloning from audio samples
- High-quality speech synthesis
- Flexible input formats (file paths, bytes, streams, tensors)
- Audio cleanup utilities
- Automatic audio transcription (using Whisper)
- Plugin architecture for different speech models
Installation
Prerequisites
- Python 3.10+
- PyTorch and TorchAudio
- CUDA-compatible GPU (recommended)
- FFmpeg for audio processing
Install the CSM Model
The intention is to make voxy into a plugin-enabled facade, where you can chose your
own engine (for voice cloning, voice synthesis, etc.).
But for now, we just support, what seems to be the best open-source model out there
(at the time of writing this):
Sesame AI Lab's
CSM model. It's just that, well, they did an amazing job at the model, but a terrible one
(so far) for the python interface -- which is what inspired me to develop voxy
in the first place.
Follow the instructions in the CSM repository to install the CSM model and its dependencies.
Quick Start
Basic Usage
from voxy import create_speech_model
# Create a speech model
model = create_speech_model(model_type="csm")
# Generate speech with default voice
audio = model.generate_speech(
text="Hello, this is a test of the CSM speech model.",
output_path="output.wav"
)
Voice Cloning
from voxy import create_speech_model
# Create a speech model
model = create_speech_model(model_type="csm")
# Clone a voice from an audio file
voice_profile = model.clone_voice(
audio_input="sample_voice.wav",
transcript="This is a sample of my voice for cloning purposes."
)
# Generate speech with the cloned voice
audio = model.generate_speech(
text="This is my cloned voice speaking. Isn't it amazing?",
voice_profile=voice_profile,
output_path="cloned_voice.wav"
)
Automatic Transcription
from voxy import create_speech_model
# Create a speech model
model = create_speech_model(model_type="csm")
# Clone a voice with automatic transcription
voice_profile = model.clone_voice(
audio_input="sample_voice.wav",
# No transcript provided, will use automatic transcription
)
# Generate speech with the cloned voice
audio = model.generate_speech(
text="This voice was cloned using automatic transcription.",
voice_profile=voice_profile,
output_path="auto_transcribed_voice.wav"
)
Flexible Input Formats
The module supports various input formats:
# From file path
voice_profile1 = model.clone_voice(
audio_input="sample_voice.wav",
transcript="Text transcript."
)
# From bytes
with open("sample_voice.wav", "rb") as f:
audio_bytes = f.read()
voice_profile2 = model.clone_voice(
audio_input=audio_bytes,
transcript="Text transcript."
)
# From file object
with open("sample_voice.wav", "rb") as f:
voice_profile3 = model.clone_voice(
audio_input=f,
transcript="Text transcript."
)
# From tensor
import torch
import torchaudio
audio_tensor, sample_rate = torchaudio.load("sample_voice.wav")
voice_profile4 = model.clone_voice(
audio_input=audio_tensor,
transcript="Text transcript."
)
Configuration
You can configure the default device by setting the DFLT_VOXY_DEVICE environment variable:
# Use CUDA
export DFLT_VOXY_DEVICE=cuda
# Use CPU
export DFLT_VOXY_DEVICE=cpu
# Use MPS (Apple Silicon)
export DFLT_VOXY_DEVICE=mps
Advanced Usage
Audio Cleanup
The module includes an audio cleanup function that normalizes volume and removes silence:
from voxy import cleanup_audio
import torchaudio
# Load audio
audio, sample_rate = torchaudio.load("noisy_audio.wav")
# Clean up audio
cleaned_audio = cleanup_audio(
audio=audio,
sample_rate=sample_rate,
normalize=True,
remove_silence=True,
silence_threshold=0.02,
min_silence_duration=0.2
)
# Save cleaned audio
torchaudio.save("cleaned_audio.wav", cleaned_audio, sample_rate)
Disabling Audio Cleanup
You can disable audio cleanup when cloning a voice:
voice_profile = model.clone_voice(
audio_input="sample_voice.wav",
transcript="This is a sample of my voice.",
cleanup_audio_fn=None # Disable audio cleanup
)
Custom Audio Cleanup
You can also provide your own audio cleanup function:
def my_custom_cleanup(audio, sample_rate, **kwargs):
# Custom cleanup logic
return processed_audio
voice_profile = model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voxy-0.0.2.tar.gz.
File metadata
- Download URL: voxy-0.0.2.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f8d0a1a3820457a9a5975fcd230c1e17331bec8ba41bb81188444bc57a61bd5
|
|
| MD5 |
8591fb76503f48623ec1151668761d6f
|
|
| BLAKE2b-256 |
323b1d6cd3a970350e908291208d171ea89e3c7616a506e7cc0cf48d638861ca
|
File details
Details for the file voxy-0.0.2-py3-none-any.whl.
File metadata
- Download URL: voxy-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f508860d51a963b6eb924b90b4a1cded9242ebb00073cf2dd4b6739a030a133b
|
|
| MD5 |
8393db14041f27590bf7f1a883db8ce2
|
|
| BLAKE2b-256 |
d9ae1191bc0ce53dea141abd2c81436e2a1f43d82bcbc2c95df46478cd6c4097
|