Contrastive Language-Audio Pretraining Model from LAION

These details have not been verified by PyPI

Project links

Project description

CLAP

Contrastive Language-Audio Pretraining, known as CLAP. Referring to the CLIP (Contrastive Language-Image Pretraining) architecture, similarly, the CLAP architecture is as follows.

The Contrastive Language-Audio Pretraining Model Architecture

The repository contains code for the following paper:

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

About this project

This project is a project in LAION that aims at learning better audio understanding and getting more audio data. This is an opensource project. We adopt the codebase of open_clip for this project. The major opensource contributers of this project are (in equal contribution): Yusong Wu, Tianyu Zhang, Ke Chen.

many thanks to @cfoster0 for allowing us to use his repo name.

Environment Installation

To install the same environment as we use, please run the following command:

conda create env -n clap python=3.10
conda activate clap
git clone https://github.com/LAION-AI/CLAP.git
cd CLAP
# you can also install pytorch by following the official instruction (https://pytorch.org/get-started/locally/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Dataset format

We use training data in webdataset format. For details of our dataset please see https://github.com/LAION-AI/audio-dataset.

You can find an example of our dataset format in here. It contains the full ESC50 dataset, split according to the first 5-fold split.

Training, Fine-tuning and Evaluation

Please find the script of training, fine-tuning and evaluation (zero-shot and retrieval) in the experiment_scripts folder. The scripts included there are the one we used to train our model on a SLURM cluster. You need to change the script to fit your own environment. For example, in a single machine multi-GPU setting, you might want to use torchrun instead of srun to run the script. To train on a single GPU machine, use CUDA_VISIBLE_DEVICES=0 python -m ... instead of srun. We use Weights and Biases for experiment logging. You need to configure the weights and biases in your environment.

Loading Model and Inference

Please refer to infer_demo.py to get the whole view of using our model to infer the audio and text embeddings. Below is the core code.

# import necessary libraries
def infer_audio():
    
    '''
    set hyperparameters, and load pretrain model
    '''
    
    # load the waveform of the shape (T,), should resample to 48000
    audio_waveform, sr = librosa.load('/home/la/kechen/Research/KE_CLAP/ckpt/test_clap_long.wav', sr=48000) 
    # quantize
    audio_waveform = int16_to_float32(float32_to_int16(audio_waveform))
    audio_waveform = torch.from_numpy(audio_waveform).float()
    audio_dict = {}

    # the 'fusion' truncate mode can be changed to 'rand_trunc' if run in unfusion mode
    audio_dict = get_audio_features(
        audio_dict, audio_waveform, 480000, 
        data_truncating='fusion', 
        data_filling='repeatpad',
        audio_cfg=model_cfg['audio_cfg']
    )
    # can send a list to the model, to process many audio tracks in one time (i.e. batch size)
    audio_embed = model.get_audio_embedding([audio_dict])
    print(audio_embed.size())

def infer_text():
    '''
    set hyperparameters, and load pretrain model
    '''
    
    # load the text, can be a list (i.e. batch size)
    text_data = ["I love the contrastive learning", "I love the pretrain model"] 
    # tokenize for roberta, if you want to tokenize for another text encoder, please refer to data.py#L43-90 
    text_data = tokenizer(text_data)
    
    text_embed = model.get_text_embedding(text_data)
    print(text_embed.size())

Pretrained Models

The pretrained checkpoints can be found in here. Please refer to the previous section for how to load and run the checkpoints.

The checkpoints list here for each model setting is the one with the highest average mAP score in training. The average mAP score is calculated by averaging 4 scores: A-->T mAP@10 on AudioCaps, and T-->A mAP@10 on AudioCaps, A-->T mAP@10 on Clotho, and T-->A mAP@10 on Clotho.

Reproducibility

An example of the preprocessed Clotho dataset in webdataset format can be download here (by downloading, you will be agreeing the license described in the Clotho dataset). The audio encoder pretrained with 48kHz AudioSet can be found here, where HTSAT-fullset-imagenet-map=0.467.ckpt is the checkpoint used to initalize our HTSAT audio encoder. You should get similar result by loading from the audio encoder checkpoint and training on same dataset. Because most of the dataset has copyright restriction, unfortunatly we cannot directly share other preprocessed datasets. The caption generated by keyword-to-caption model for Audioset can be found here

Citation

If you find this project and the LAION-Audio-630K dataset useful, please cite our paper:

@article{wu2022large,
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  journal={arXiv preprint arXiv:2211:06687},
  year = {2022},
}

Acknowledgements

This project is working in progress, thus the codebase and model might not be perfect or bug-free. We will very much appreciate any kind of contribution or and issue raised. If you find a bug or have any suggestion, please feel free to open an issue or contact us. If you would actively contribute to this project, please join the discord of LAION.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.7

May 4, 2025

1.1.6

Jul 9, 2024

1.1.5

Jul 9, 2024

1.1.4

Apr 19, 2023

1.1.3

Apr 10, 2023

1.1.2

Apr 1, 2023

1.1.1

Apr 1, 2023

1.1.0

Mar 14, 2023

1.0.0

Mar 1, 2023

0.1.1

Feb 28, 2023

0.1.0

Feb 28, 2023

This version

0.0.9

Feb 28, 2023

0.0.8

Feb 27, 2023

0.0.7

Feb 27, 2023

0.0.6

Feb 15, 2023

0.0.5

Feb 15, 2023

0.0.4

Feb 15, 2023

0.0.3

Feb 15, 2023

0.0.2

Feb 15, 2023

0.0.1

Feb 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laion_clap-0.0.9.tar.gz (1.5 MB view details)

Uploaded Feb 28, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

laion_clap-0.0.9-py3-none-any.whl (1.5 MB view details)

Uploaded Feb 28, 2023 Python 3

File details

Details for the file laion_clap-0.0.9.tar.gz.

File metadata

Download URL: laion_clap-0.0.9.tar.gz
Upload date: Feb 28, 2023
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for laion_clap-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`b770be5121ccb08e3bf7bc2ed940d192746cf7c3991b2066b0be6fdb2e639825`
MD5	`ecdbec7a6ffa330666503a43886095ee`
BLAKE2b-256	`0fbf0ad4035cf0dc2cf75aaaab9a81afa71630c636a488c20bfc3909ec1d843a`

See more details on using hashes here.

File details

Details for the file laion_clap-0.0.9-py3-none-any.whl.

File metadata

Download URL: laion_clap-0.0.9-py3-none-any.whl
Upload date: Feb 28, 2023
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for laion_clap-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0a7a11b24c87b18d638b7c03cb2670e004d2a6f5997424366d6deb8040da893`
MD5	`41ccdc267a4865b7a8ac9f090a269958`
BLAKE2b-256	`5a363a6b1ce52d9a41cf64c3cad0fe1acc146a3819616ab95b37d7b73703c5a8`

See more details on using hashes here.

laion-clap 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CLAP

About this project

Environment Installation

Dataset format

Training, Fine-tuning and Evaluation

Loading Model and Inference

Pretrained Models

Reproducibility

Citation

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes