Contrastive Language-Audio Pretraining Model from LAION
Project description
CLAP
Contrastive Language-Audio Pretraining, known as CLAP. Referring to the CLIP (Contrastive Language-Image Pretraining) architecture, similarly, the CLAP architecture is as follows.
The repository contains code for the following paper, accepted by IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023:
About this project
This project is a project in LAION that aims at learning better audio understanding and getting more audio data. This is an opensource project. We adopt the codebase of open_clip for this project. The major opensource contributers of this project are (in equal contribution): Yusong Wu, Tianyu Zhang, Ke Chen.
many thanks to @cfoster0 for allowing us to use his repo name.
Quick Start
We provide the library for our CLAP model:
pip install laion_clap
Then you can follow the below usage or refer to unit_test.py.
import librosa
import laion_clap
model = laion_clap.CLAP_Module(enable_fusion=True)
model.load_ckpt()
# Directly get audio embeddings from audio files
audio_file = [
'/home/la/kechen/Research/KE_CLAP/ckpt/test_clap_short.wav',
'/home/la/kechen/Research/KE_CLAP/ckpt/test_clap_long.wav'
]
audio_embed = model.get_audio_embedding_from_filelist(x = audio_file)
print(audio_embed)
print(audio_embed.shape)
# Get audio embeddings from audio data
audio_data, _ = librosa.load('/home/la/kechen/Research/KE_CLAP/ckpt/test_clap_short.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_embed = model.get_audio_embedding_from_data(x = audio_data)
print(audio_embed)
print(audio_embed.shape)
# Get text embedings from texts:
text_data = ["I love the contrastive learning", "I love the pretrain model"]
text_embed = model.get_text_embedding(text_data)
print(text_embed)
print(text_embed.shape)
Environment Installation
If you want to check and reuse our model into your project instead of directly using the pip library, you need to install the same environment as we use, please run the following command:
conda create env -n clap python=3.10
conda activate clap
git clone https://github.com/LAION-AI/CLAP.git
cd CLAP
# you can also install pytorch by following the official instruction (https://pytorch.org/get-started/locally/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Dataset format
We use training data in webdataset format. For details of our dataset please see https://github.com/LAION-AI/audio-dataset.
You can find an example of our dataset format in here. It contains the full ESC50 dataset, split according to the first 5-fold split.
Training, Fine-tuning and Evaluation
Please find the script of training, fine-tuning and evaluation (zero-shot and retrieval) in the experiment_scripts folder.
The scripts included there are the one we used to train our model on a SLURM cluster.
You need to change the script to fit your own environment.
For example, in a single machine multi-GPU setting, you might want to use torchrun
instead of srun
to run the script.
To train on a single GPU machine, use CUDA_VISIBLE_DEVICES=0 python -m ...
instead of srun
.
We use Weights and Biases for experiment logging. You need to configure the weights and biases in your environment.
Core Code
Please refer to main.py, train.py, data.py,and model.py to quicly get familiar with our model.
Pretrained Models
The pretrained checkpoints can be found in here. Please refer to the previous section for how to load and run the checkpoints.
The checkpoints list here for each model setting is the one with the highest average mAP score in training. The average mAP score is calculated by averaging 4 scores: A-->T mAP@10 on AudioCaps, and T-->A mAP@10 on AudioCaps, A-->T mAP@10 on Clotho, and T-->A mAP@10 on Clotho.
Reproducibility
An example of the preprocessed Clotho dataset in webdataset format can be download here (by downloading, you will be agreeing the license described in the Clotho dataset). The audio encoder pretrained with 48kHz AudioSet can be found here, where HTSAT-fullset-imagenet-map=0.467.ckpt
is the checkpoint used to initalize our HTSAT audio encoder. You should get similar result by loading from the audio encoder checkpoint and training on same dataset.
Because most of the dataset has copyright restriction, unfortunatly we cannot directly share other preprocessed datasets. The caption generated by keyword-to-caption model for Audioset can be found here
Citation
If you find this project and the LAION-Audio-630K dataset useful, please cite our paper:
@inproceedings{laionclap2023,
title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2023}
}
@inproceedings{htsatke2022,
author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2022}
}
Acknowledgements
This project is working in progress, thus the codebase and model might not be perfect or bug-free. We will very much appreciate any kind of contribution or and issue raised. If you find a bug or have any suggestion, please feel free to open an issue or contact us. If you would actively contribute to this project, please join the discord of LAION.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for laion_clap-1.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f7b277600ff3705bc5f597670b740ba191864284ca3ce9fbbc71c344f985d4b |
|
MD5 | 079d930c65e3b3c4cd89572977c52aa4 |
|
BLAKE2b-256 | b45ec26fa4892830eea6b4694afd0c31b0c678cfd058c5e26aea48270cd8fcf4 |