Encode and decode audio samples to/from compressed representations!

These details have not been verified by PyPI

Project links

Homepage

Project description

Music2Latent

Encode and decode audio samples to/from compressed representations! Useful for efficient generative modeling applications and for other downstream tasks.

music2latent

Read the ISMIR 2024 paper here. Listen to audio samples here.

Under the hood, Music2Latent uses a Consistency Autoencoder model to efficiently encode and decode audio samples.

44.1 kHz audio is encoded into a sequence of ~10 Hz, and each of the latents has 64 channels. 48 kHz audio can also be encoded, which results in a sequence of ~12 Hz. A generative model can then be trained on these embeddings, or they can be used for other downstream tasks.

Music2Latent was trained on music and on speech. Refer to the paper for more details.

Installation

pip install music2latent

The model weights will be downloaded automatically the first time the code is run for inference.

How to Use (Inference)

To encode and decode audio samples to/from latent embeddings:

import librosa
from music2latent import EncoderDecoder

audio_path = librosa.example('trumpet')
wv, sr = librosa.load(audio_path, sr=44100)  # Music2Latent supports 48kHz audio as well

encdec = EncoderDecoder()

latent = encdec.encode(wv)
# latent has shape (batch_size/audio_channels, dim (64), sequence_length)

wv_rec = encdec.decode(latent)

# Listen to the reconstructed audio
# import IPython
# IPython.display.display(IPython.display.Audio(wv_rec.squeeze(), rate=sr))

To extract encoder features (before the bottleneck) for downstream tasks:

features = encdec.encode(wv, extract_features=True)
# 'features' will have more channels than 'latent' but cannot be decoded.

Loading Custom Trained Models

The EncoderDecoder class, by default, loads our pre-trained model. If you want to use a model you trained yourself, specify the path of the checkpoint of your model in hparams_inference.py by changing the load_path_inference_default variable.

music2latent supports more advanced usage, including GPU memory management controls. Please refer to tutorial.ipynb.

Training

Make sure your environment is set up with the dependencies listed in requirements.txt. Music2Latent relies on numpy, soundfile, huggingface_hub, torch>=2.5.0, laion-clap, torchaudio, librosa, scipy.

1. Configuration

Music2Latent uses a Python-based configuration system. Instead of separate .yaml or .json files, you create a Python file (e.g., config.py) that overrides default settings.

Default Hyperparameters: All the default hyperparameters are defined in music2latent/hparams.py. You don't need to copy all of these into your configuration file; you can only specify the ones you want to change.

Example Configuration File (config.py):

# config.py (example)

batch_size = 16                                                             # batch size
lr = 0.0001                                                                 # learning rate
total_iters = 800000                                                        # total iterations

data_paths = ['/media/datasets/dataset1', '/media/datasets/dataset2']       # list of paths of training datasets (use a single-element list for a single dataset). Audio files will be recursively searched in these paths and in their sub-paths
data_path_test = '/media/datasets/test_dataset'                             # path of samples used for FAD testing (e.g. musiccaps)

You always need to specify data_paths and data_path_test. data_paths should be a list of paths to your training datasets. data_path_test should be the path to your test dataset, used for calculating the Frechet Audio Distance (FAD) metric during training.

Important Hyperparameters:

batch_size: Batch size for training
lr: Initial learning rate.
lr_decay: Learning rate decay schedule (cosine, linear, inverse_sqrt, or None).
total_iters: Total number of training iterations.
data_paths: A list of paths to your training datasets. The code recursively searches for .wav and .flac files (or other extensions you specify in data_extensions).
data_fractions: A list of sampling weights, specifying how often to sample from each dataset in data_paths. If None, datasets are sampled uniformly.
data_path_test: The path to your test dataset, used for calculating the Frechet Audio Distance (FAD) during training.
compile_model: Whether to use torch.compile for potential speedups (see below).
multi_gpu: Enable multi-GPU training with torchrun.
accumulate_gradients: Accumulates gradients over multiple batches before updating. This lets you use larger effective batch sizes without exceeding GPU memory.
checkpoint_path: Directory where checkpoints (saved models) are stored.
load_path: Load checkpoint from this path to resume training.
num_workers: Number of workers the dataloader will use.

See music2latent/hparams.py for all available hyperparameters and their default values. You can override any of these in your config.py.

Also, see the configs/config.py file for an example configuration file containing all the hyperparameters and a description of each. You can copy this file and modify it to suit your needs.

2. Launching a Training Run

To start a training run, use the launch.py script with the --config argument:

python launch.py --config path/to/your/config.py

Checkpoints: During training, checkpoints will be stored in the directory specified by the checkpoint_path hyperparameter (default: checkpoints). The best checkpoint (lowest FAD) and/or the latest checkpoint will be kept during training.

TensorBoard: Training progress (loss, FAD, audio samples, etc.) is logged using TensorBoard. You can view these logs by running:

tensorboard --logdir=<your_checkpoint_path>

Replace <your_checkpoint_path> with the actual path to your checkpoint directory (by default this is checkpoints).

3. Multi-GPU Training

To use multiple GPUs, use torchrun.

Example (using 3 GPUs):

CONDA_VISIBLE_DEVICES=0,1,2 torchrun --nnodes=1 --nproc_per_node=3 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 launch.py --config my_config.py

CONDA_VISIBLE_DEVICES=0,1,2: This makes GPUs 0, 1, and 2 visible to your training process. Adjust this based on your system's GPU configuration.
--nnodes=1: We're running on a single machine (node).
--nproc_per_node=3: We're using 3 GPUs (processes per node).
--rdzv_backend=c10d: Specifies the rendezvous backend (how the processes find each other).
--rdzv_endpoint=localhost:0: Specifies the rendezvous endpoint (use a free port; 0 will often pick a random free port).
launch.py: Our training script.
--config my_config.py: Your configuration file. Make sure to set multi_gpu = True in your config.py.

Important: When using torchrun, each GPU will process a batch size that is equal to batch_size divided by the number of GPUs. For example, if batch_size = 16 and you're using 3 GPUs, each GPU will process a batch size of 5 (16 / 3 = 5.33, rounded down to 5).

4. Model Compilation (`torch.compile`)

Music2Latent supports torch.compile, a feature introduced in PyTorch 2.0 that can significantly speed up training.

Enabling Compilation: Set compile_model = True in your configuration file (it's True by default).
First Run: The first time you run with compile_model = True, PyTorch will compile your model. This can take a significant amount of time (e.g., 10+ minutes, possibly longer, depending on your hardware). The compiled model will be cached, so subsequent runs will be much faster.
Cache Directory: The compiled model is cached in the directory specified by torch_compile_cache_dir (default: tmp/torch_compile).

5. Resuming Training

To resume training from a checkpoint, set the load_path parameter in your config.py to the path of the checkpoint file you want to resume from. Also, consider setting load_optimizer = False if you encounter issues resuming.

# config.py (for resuming)
load_path = "checkpoints/my_run/model_fid_X_loss_X_iters_X.pt"

License

This library is released under the CC BY-NC 4.0 license. Please refer to the LICENSE file for more details.

This work was conducted by Marco Pasini during his PhD at Queen Mary University of London, in partnership with Sony Computer Science Laboratories Paris. This work was supervised by Stefan Lattner and George Fazekas.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.8

Nov 7, 2025

This version

0.1.7

Oct 29, 2025

0.1.6

Aug 16, 2024

0.1.5

Aug 15, 2024

0.1.4

Aug 13, 2024

0.1.3

Aug 13, 2024

0.1.2

Aug 13, 2024

0.1.1

Aug 12, 2024

0.1.0

Aug 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

music2latent-0.1.7.tar.gz (43.6 kB view details)

Uploaded Oct 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

music2latent-0.1.7-py3-none-any.whl (44.5 kB view details)

Uploaded Oct 29, 2025 Python 3

File details

Details for the file music2latent-0.1.7.tar.gz.

File metadata

Download URL: music2latent-0.1.7.tar.gz
Upload date: Oct 29, 2025
Size: 43.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for music2latent-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`2b361d0d43eb44a64fa8f79c58170f4f3c922fec2b4fe81d78857eef243fbe2c`
MD5	`0a564dc15318c2ce95c4a81f616c274f`
BLAKE2b-256	`0c695849ad31e1178d9d250185b0358ae94407f52382320de6b9d979439d7c45`

See more details on using hashes here.

File details

Details for the file music2latent-0.1.7-py3-none-any.whl.

File metadata

Download URL: music2latent-0.1.7-py3-none-any.whl
Upload date: Oct 29, 2025
Size: 44.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for music2latent-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3ed30898180c06659ad431a485804d26c0d451043d09f8e72cdcf34fb499b29`
MD5	`6ee13c9fc45bb0676fd8edcba9ed4f53`
BLAKE2b-256	`70e52eb887af2b01abca18d68aac33a1f98950a155745bdbf51f9d1d7fa14c7e`

See more details on using hashes here.

music2latent 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Music2Latent

Installation

How to Use (Inference)

Training

1. Configuration

2. Launching a Training Run

3. Multi-GPU Training

4. Model Compilation (`torch.compile`)

5. Resuming Training

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

music2latent 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Music2Latent

Installation

How to Use (Inference)

Training

1. Configuration

2. Launching a Training Run

3. Multi-GPU Training

4. Model Compilation (torch.compile)

5. Resuming Training

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

4. Model Compilation (`torch.compile`)