This package is written for text-to-audio/music generation.

These details have not been verified by PyPI

Project links

Homepage

Project description

AudioLDM 2

This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.

Change Log

2023-08-27: Add two new checkpoints!
- 🌟 48kHz AudioLDM model: Now we support high-fidelity audio generation! Use this checkpoint simply by setting "--model_name audioldm_48k"
- 16kHz improved AudioLDM model: Trained with more data and optimized model architecture.

TODO

Add the text-to-speech checkpoint
Open-source the AudioLDM training code.
Support the generation of longer audio (> 10s)
Optimizing the inference speed of the model.
Integration with the Diffusers library

Web APP

Prepare running environment

conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2

Start the web application (powered by Gradio)

python3 app.py

A link will be printed out. Click the link to open the browser and play.

Commandline Usage

Installation

Prepare running environment

# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git

If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by

sudo apt-get install espeak

Run the model in commandline

Generate sound effect or Music based on a text prompt

audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Generate sound effect or music based on a list of text

audioldm2 -tl batch.lst

Generate speech based on (1) the transcription and (2) the description of the speaker

audioldm2 -t "A female reporter is speaking full of emotion" --transciption "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transciption "Wish you have a good day"

Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.

Random Seed Matters

Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.

audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Pretrained Models

You can choose model checkpoint by setting up "model_name":

# CUDA
audioldm2 --model_name "audioldm_48k" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm_48k" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

We have five checkpoints you can choose:

audioldm_48k (default): This checkpoint can generate high fidelity sound effect and music.
audioldm2-full: Generate both sound effect and music generation with the AudioLDM2 architecture.
audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
audioldm2-full-large-1150k: Larger version of audioldm2-full.
audioldm2-music-665k: Music generation.
audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:

cpu
cuda
mps ( Notice that the computation requires about 20GB of RAM. )

Other options

  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED]

  optional arguments:
    -h, --help            show this help message and exit
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.

Cite this work

If you found this tool useful, please consider citing

@article{liu2023audioldm2,
  title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
  journal={arXiv preprint arXiv:2308.05734},
  year={2023}
}

@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Aug 27, 2023

0.0.9

Aug 5, 2023

0.0.8

Aug 5, 2023

0.0.7

Aug 5, 2023

0.0.6

Aug 5, 2023

0.0.5

Aug 5, 2023

0.0.4

Aug 5, 2023

0.0.3

Aug 5, 2023

0.0.2

Aug 4, 2023

0.0.1

Aug 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audioldm2-0.1.0.tar.gz (2.9 MB view details)

Uploaded Aug 27, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audioldm2-0.1.0-py3-none-any.whl (2.9 MB view details)

Uploaded Aug 27, 2023 Python 3

File details

Details for the file audioldm2-0.1.0.tar.gz.

File metadata

Download URL: audioldm2-0.1.0.tar.gz
Upload date: Aug 27, 2023
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for audioldm2-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d77beb8cf5a671500f52642e3409a59f37a37b936f66b4630303db3dabbcd478`
MD5	`183ff52015e3ac673a09f4162adb362e`
BLAKE2b-256	`957b0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae`

See more details on using hashes here.

File details

Details for the file audioldm2-0.1.0-py3-none-any.whl.

File metadata

Download URL: audioldm2-0.1.0-py3-none-any.whl
Upload date: Aug 27, 2023
Size: 2.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for audioldm2-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53040826ac578aeda99fc3da58c3e325381a7d06b62ddf277d4ede9c36131eba`
MD5	`de921ad1eaeee0dd43b0bd4be1314f7d`
BLAKE2b-256	`5ab9dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177`

See more details on using hashes here.

audioldm2 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AudioLDM 2

Change Log

TODO

Web APP

Commandline Usage

Installation

Run the model in commandline

Random Seed Matters

Pretrained Models

Other options

Cite this work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes