Skip to main content

Facade for voice cloning and speech synthesis

Project description

voxy

Facade for voice cloning and speech synthesis

To install: pip install voxy

Voxy is a flexible Python module for speech synthesis and voice cloning, with initial support for the Sesame CSM-1B model. It provides a plugin architecture that can be extended to support other models in the future.

Features

  • Voice cloning from audio samples
  • High-quality speech synthesis
  • Flexible input formats (file paths, bytes, streams, tensors)
  • Audio cleanup utilities
  • Automatic audio transcription (using Whisper)
  • Plugin architecture for different speech models

Installation

Prerequisites

  • Python 3.10+
  • PyTorch and TorchAudio
  • CUDA-compatible GPU (recommended)
  • FFmpeg for audio processing

Install the CSM Model

The intention is to make voxy into a plugin-enabled facade, where you can chose your own engine (for voice cloning, voice synthesis, etc.). But for now, we just support, what seems to be the best open-source model out there (at the time of writing this): Sesame AI Lab's CSM model. It's just that, well, they did an amazing job at the model, but a terrible one (so far) for the python interface -- which is what inspired me to develop voxy in the first place.

Follow the instructions in the CSM repository to install the CSM model and its dependencies.

Quick Start

Basic Usage

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Generate speech with default voice
audio = model.generate_speech(
    text="Hello, this is a test of the CSM speech model.", 
    output_path="output.wav"
)

Voice Cloning

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Clone a voice from an audio file
voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="This is a sample of my voice for cloning purposes."
)

# Generate speech with the cloned voice
audio = model.generate_speech(
    text="This is my cloned voice speaking. Isn't it amazing?",
    voice_profile=voice_profile,
    output_path="cloned_voice.wav"
)

Automatic Transcription

from voxy import create_speech_model

# Create a speech model
model = create_speech_model(model_type="csm")

# Clone a voice with automatic transcription
voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    # No transcript provided, will use automatic transcription
)

# Generate speech with the cloned voice
audio = model.generate_speech(
    text="This voice was cloned using automatic transcription.",
    voice_profile=voice_profile,
    output_path="auto_transcribed_voice.wav"
)

Flexible Input Formats

The module supports various input formats:

# From file path
voice_profile1 = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="Text transcript."
)

# From bytes
with open("sample_voice.wav", "rb") as f:
    audio_bytes = f.read()
voice_profile2 = model.clone_voice(
    audio_input=audio_bytes,
    transcript="Text transcript."
)

# From file object
with open("sample_voice.wav", "rb") as f:
    voice_profile3 = model.clone_voice(
        audio_input=f,
        transcript="Text transcript."
    )

# From tensor
import torch
import torchaudio
audio_tensor, sample_rate = torchaudio.load("sample_voice.wav")
voice_profile4 = model.clone_voice(
    audio_input=audio_tensor,
    transcript="Text transcript."
)

Configuration

You can configure the default device by setting the DFLT_VOXY_DEVICE environment variable:

# Use CUDA
export DFLT_VOXY_DEVICE=cuda

# Use CPU
export DFLT_VOXY_DEVICE=cpu

# Use MPS (Apple Silicon)
export DFLT_VOXY_DEVICE=mps

Advanced Usage

Audio Cleanup

The module includes an audio cleanup function that normalizes volume and removes silence:

from voxy import cleanup_audio
import torchaudio

# Load audio
audio, sample_rate = torchaudio.load("noisy_audio.wav")

# Clean up audio
cleaned_audio = cleanup_audio(
    audio=audio,
    sample_rate=sample_rate,
    normalize=True,
    remove_silence=True,
    silence_threshold=0.02,
    min_silence_duration=0.2
)

# Save cleaned audio
torchaudio.save("cleaned_audio.wav", cleaned_audio, sample_rate)

Disabling Audio Cleanup

You can disable audio cleanup when cloning a voice:

voice_profile = model.clone_voice(
    audio_input="sample_voice.wav",
    transcript="This is a sample of my voice.",
    cleanup_audio_fn=None  # Disable audio cleanup
)

Custom Audio Cleanup

You can also provide your own audio cleanup function:

def my_custom_cleanup(audio, sample_rate, **kwargs):
    # Custom cleanup logic
    return processed_audio

voice_profile = model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxy-0.0.2.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxy-0.0.2-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file voxy-0.0.2.tar.gz.

File metadata

  • Download URL: voxy-0.0.2.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for voxy-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0f8d0a1a3820457a9a5975fcd230c1e17331bec8ba41bb81188444bc57a61bd5
MD5 8591fb76503f48623ec1151668761d6f
BLAKE2b-256 323b1d6cd3a970350e908291208d171ea89e3c7616a506e7cc0cf48d638861ca

See more details on using hashes here.

File details

Details for the file voxy-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: voxy-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.13

File hashes

Hashes for voxy-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f508860d51a963b6eb924b90b4a1cded9242ebb00073cf2dd4b6739a030a133b
MD5 8393db14041f27590bf7f1a883db8ce2
BLAKE2b-256 d9ae1191bc0ce53dea141abd2c81436e2a1f43d82bcbc2c95df46478cd6c4097

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page