This package is written for text-to-audio/music generation.
Project description
AudioLDM 2
This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.
Change Log
- 2023-08-27: Add two new checkpoints!
- 🌟 48kHz AudioLDM model: Now we support high-fidelity audio generation! Use this checkpoint simply by setting "--model_name audioldm_48k"
- 16kHz improved AudioLDM model: Trained with more data and optimized model architecture.
TODO
- Add the text-to-speech checkpoint
- Open-source the AudioLDM training code.
- Support the generation of longer audio (> 10s)
- Optimizing the inference speed of the model.
- Integration with the Diffusers library
Web APP
- Prepare running environment
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
- Start the web application (powered by Gradio)
python3 app.py
- A link will be printed out. Click the link to open the browser and play.
Commandline Usage
Installation
Prepare running environment
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by
sudo apt-get install espeak
Run the model in commandline
- Generate sound effect or Music based on a text prompt
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
- Generate sound effect or music based on a list of text
audioldm2 -tl batch.lst
- Generate speech based on (1) the transcription and (2) the description of the speaker
audioldm2 -t "A female reporter is speaking full of emotion" --transciption "Wish you have a good day"
audioldm2 -t "A female reporter is speaking" --transciption "Wish you have a good day"
Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.
Random Seed Matters
Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
Pretrained Models
You can choose model checkpoint by setting up "model_name":
# CUDA
audioldm2 --model_name "audioldm_48k" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
# MPS
audioldm2 --model_name "audioldm_48k" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
We have five checkpoints you can choose:
- audioldm_48k (default): This checkpoint can generate high fidelity sound effect and music.
- audioldm2-full: Generate both sound effect and music generation with the AudioLDM2 architecture.
- audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
- audioldm2-full-large-1150k: Larger version of audioldm2-full.
- audioldm2-music-665k: Music generation.
- audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
- audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.
We currently support 3 devices:
- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )
Other options
usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
[--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
[-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
[--seed SEED]
optional arguments:
-h, --help show this help message and exit
-t TEXT, --text TEXT Text prompt to the model for audio generation
--transcription TRANSCRIPTION
Transcription used for speech synthesis
-tl TEXT_LIST, --text_list TEXT_LIST
A file that contains text prompt to the model for audio generation
-s SAVE_PATH, --save_path SAVE_PATH
The path to save model output
--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
The checkpoint you gonna use
-d DEVICE, --device DEVICE
The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
-b BATCHSIZE, --batchsize BATCHSIZE
Generate how many samples at the same time
--ddim_steps DDIM_STEPS
The sampling step for DDIM
-gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
-n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
heavier computation
--seed SEED Change this value (any integer number) will lead to a different generation result.
Cite this work
If you found this tool useful, please consider citing
@article{liu2023audioldm2,
title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
journal={arXiv preprint arXiv:2308.05734},
year={2023}
}
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for audioldm2-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53040826ac578aeda99fc3da58c3e325381a7d06b62ddf277d4ede9c36131eba |
|
MD5 | de921ad1eaeee0dd43b0bd4be1314f7d |
|
BLAKE2b-256 | 5ab9dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177 |