Skip to main content

Create natural sounding audio from text, clone voices and use them. Convert voice to voice. Bark model.

Project description

SpeechCraft

Text2Speech, Voice-Cloning and Voice2Voice conversion

Ever wanted to create natural sounding speech from text, clone a voice or sound like someone else? SpeechCraft is ideal for creating voiceovers, audiobooks, or just having fun.

Features:

  • Text2speech synthesis with the 🐶 Bark model of Suno.ai
    • Generate text in different languages
    • Supports emotions & singing.
  • Speaker generation / embedding generation aka voice cloning
  • Voice2voice synthesis: given an audio file, generate a new audio file with the voice of a different speaker.
  • Convenient deployment ready web API with FastTaskAPI
  • Automatic download of models

Quick Links

Also check-out other socaity projects for generative AI:

Example generations and cloned voices

https://github.com/SocAIty/SpeechCraft/assets/7961324/dbf905ea-df37-4e52-9e93-a9833352459d

The hermine voice was generated with the voice_clone_test_voice_1.wav file with around 11 seconds of clear speech.

https://github.com/SocAIty/SpeechCraft/assets/7961324/71a039c7-e665-4576-91c7-729052e05b03

Installation

The code runs fine on Windows and Linux.

With PIP

# from PyPi (without web API)
pip install speechcraft
# with web API
pip install speechcraft[full]
# or from GitHub for the newest version.
pip install git+https://github.com/SocAIty/speechcraft

For windows you will also need to install fairseq from a different source. python pip install fairseq@https://github.com/Sharrnah/fairseq/releases/download/v0.12.4/fairseq-0.12.4-cp310-cp310-win_amd64.whl

To use a GPU don't forget to install pytorch GPU with your correct cuda version. pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Or clone and work with the repository.

  1. Clone the repository.
  2. (Optional) Create a virtual environment. With python -m venv venv and activate it with venv/Scripts/activate.
  3. Install pip install .
  4. Don't forget to install fairseq and pytorch GPU.

Usage

We provide three ways to use the face swapping functionality.

  1. Direct module import and inference
  2. By deploying and calling the web service
  3. As part of the socaity SDK. # coming soon

Inference from script

from speechcraft import text2voice, voice2embedding, voice2voice

# simple text2speech synthesis
text = "I love society [laughs]! [happy] What a day to make voice overs with artificial intelligence."
audio_numpy, sample_rate = text2voice(text, speaker_name="en_speaker_3")

# speaker embedding generation
embedding = voice2embedding(audio_file="voice_sample_15s.wav", voice_name="hermine").save_to_speaker_lib()

# text2speech synthesis with cloned voice or embedding
audio_with_cloned_voice, sample_rate = text2voice(sample_text, voice=embedding)  # also works with voice="hermine"

# voice2voice synthesis
cloned_audio = voice2voice(audio_file="my_audio_file.wav", voice_name_or_embedding_path="hermine")

Use the following code to convert and save the audio file with the media-toolkit module.

from media_toolkit import AudioFile
audio = AudioFile().from_np_array(audio_numpy, sr=sample_rate)
audio.save("my_new_audio.wav")

Note: The first time your are using speechcraft it will download the models. These files are quite big and can take a while to download.

Web Service

image of openapi server

From python:

from speechcraft.server import start_server
start_server(port=8009)

With .bat file

  1. Start the server by running the provided .bat file "start_server.bat" 2. or by using python bark/server.py --port 8009 make sure the python PYTHONPATH is set to the root of this repository.
  2. To test the server, open http://localhost:8009/docs in your browser.

Then make post requests to the server with your favorite tool or library. Here are some examples to inference with a python client.

Note: The first time you start the server, it will download the models. This can take a while. If this fails, you can download the files manually or with the model_downloader.py script.

How to send requests

NOTE: The Webservice is built with FastTaskAPI. In this regard, for each request it will create a task and return a job id

We highly recommend to use media-toolkit for file transmission. It will make your life much easier.

from media_toolkit import AudioFile

# text2speech
response = httpx.post("http://localhost:8009/text2voice", params={ "text" : "please contribute", "voice": "en_speaker_3"})

# voice cloning
audio = AudioFile().from_file("myfile.wav")
request = httpx.post(
   "http://localhost:8009/voice2embedding", params={ "voice_name" : "hermine"}, 
    files={"audio": audio.to_httpx_send_able_tuple()}
)

# voice2voice  
response = httpx.post(
    "http://localhost:8009/voice2voice", 
    params={ "voice_name" : "hermine"}, 
    files={"audio": audio.to_httpx_send_able_tuple()}
)

This example shows how to do it with plain python requests.

import requests

# text2speech synthesis
response = requests.post("http://localhost:8009/text2voice", params={ "text" : "please contribute", "voice": "en_speaker_3"})

# Speaker embedding creation
with open("myfile.wav", "rb") as f:
    audio = f.read()
    
response = requests.post(
    "http://localhost:8009/voice2embedding", 
    params={ "voice_name" : "my_new_speaker"}, 
    files={"audio_file": audio}
)

# voice2voice
response = requests.post(
    "http://localhost:8009/voice2voice", 
    params={ "voice_name" : "my_new_speaker"}, 
    files={"audio_file": audio}
)

Parse the results

The response is a json that includes the job id and meta information. By sending then a request to the job endpoint you can check the status and progress of the job. If the job is finished, you will get the result, including the swapped image.

import requests
from media_toolkit import AudioFile

# check status of job
response = requests.get(f"http://localhost:8009/api/job/{job.json()['job_id']}")
# convert result to image file
audio = AudioFile().from_bytes(response.json()['result']))

If you want it more convenient use fastSDK to built your client, or the socaity SDK.

Details and guidelines

🛠️ Hardware and Inference Speed

Bark has been tested and works on both CPU and GPU (pytorch 2.0+, CUDA 11.7 and CUDA 12.0).

On enterprise GPUs and PyTorch nightly, Bark can generate audio in roughly real-time. On older GPUs, default colab, or CPU, inference time might be significantly slower. For older GPUs or CPU you might want to consider using smaller models. Details can be found in out tutorial sections here.

The full version of Bark requires around 12GB of VRAM to hold everything on GPU at the same time. To use a smaller version of the models, which should fit into 8GB VRAM, set the environment flag SUNO_USE_SMALL_MODELS=True.

If you don't have hardware available or if you want to play with bigger versions of our models, you can also sign up for early access to our model playground here.

⚙️ Emotion and language with the Model

Bark is fully generative text-to-audio model devolved for research and demo purposes. It follows a GPT style architecture similar to AudioLM and Vall-E and a quantized Audio representation from EnCodec. It is not a conventional TTS model, but instead a fully generative text-to-audio model capable of deviating in unexpected ways from any given script. Different to previous approaches, the input text prompt is converted directly to audio without the intermediate use of phonemes. It can therefore generalize to arbitrary instructions beyond speech such as music lyrics, sound effects or other non-speech sounds.

Below is a list of some known non-speech sounds, but we are finding more every day. Please let us know if you find patterns that work particularly well on Discord!

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]
  • or ... for hesitations
  • for song lyrics
  • CAPITALIZATION for emphasis of a word
  • [MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively

Supported Languages

Language Status
English (en)
German (de)
Spanish (es)
French (fr)
Hindi (hi)
Italian (it)
Japanese (ja)
Korean (ko)
Polish (pl)
Portuguese (pt)
Russian (ru)
Turkish (tr)
Chinese, simplified (zh)

To use a different language use the corresponding voice parameter to it like "de_speaker_1". You find preset voices and languages in the assets folder.

Voice Cloning guide

Make sure these things are NOT in your voice input: (in no particular order)

  • Noise (You can use a noise remover before)
  • Music (There are also music remover tools) (Unless you want music in the background)
  • A cut-off at the end (This will cause it to try and continue on the generation)

What makes for good prompt audio? (in no particular order)

  • Around ~7 to ~15 seconds of voice data
  • Clearly spoken
  • No weird background noises
  • Only one speaker
  • Audio which ends after a sentence ends
  • Regular/common voice (They usually have more success, it's still capable of cloning complex voices, but not as good at it)

Disclaimer

This repository is a merge of the orignal bark repository and bark-voice-cloning-HuBert-quantizer by gitmylo The credit goes to the original authors. Like the original authors, I am also not responsible for any misuse of this repository. Use at your own risk, and please act responsibly. Don't copy and publish the voice of a person without their consent.

Contribute

Any help with maintaining and extending the package is welcome. Feel free to open an issue or a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechcraft-0.0.2.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

SpeechCraft-0.0.2-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file speechcraft-0.0.2.tar.gz.

File metadata

  • Download URL: speechcraft-0.0.2.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.11

File hashes

Hashes for speechcraft-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b7f451eda8de62ce7aefeb82e7e18c97400c1fbfa8fa1da4f90a162f1197c1de
MD5 0b36218d076b44459c2ae080ba911077
BLAKE2b-256 d58f4e0324b01d01f817525318f83ae34fb14f89b78e8ad0e4885477751b4b4a

See more details on using hashes here.

Provenance

File details

Details for the file SpeechCraft-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: SpeechCraft-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.11

File hashes

Hashes for SpeechCraft-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 13f66cba35bfab7460c0b8b0587ac119f2d2627800ac09f970f04849beb871d0
MD5 142cd59c88505f51addbaaf14ad3fa91
BLAKE2b-256 441bd3c56e726a2916522f297c8d51fb116f89d6b7e6c7cdea6572e12f34bafe

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page