Skip to main content

A collection of PyTorch audio datasets for speech and music applications

Project description

AudioLoader

This will be a collection of PyTorch audio datasets that are not available in the official PyTorch dataset and torchaudio dataset yet. I am building various one-click-ready audio datasets for my research, and I hope it will also benefit other people.

Currently supported datasets:

  1. Multilingual LibriSpeech (MLS)

TODO:

  1. MAPS
  2. MASETRO
  3. MusicNet

Installation

pip install git+https://github.com/KinWaiCheuk/AudioDatasets.git

Multilingual LibriSpeech

Introduction

This is a custom PyTorch Dataset for Multilingual LibriSpeech (MLS).

Multilingual LibriSpeech (MLS) contains 8 languages. This ready-to-use PyTorch Dataset class allows users to set up this dataset by just calling the MultilingualLibriSpeech class. The original dataset put all utterance labels into a single .txt file. For larger languages such as English, it causes a slow label loading. This custom Dataset class automatically splits the labels into smaller sizes.

Usage

To use this dataset for the first time, set download=True.

dataset = MultilingualLibriSpeech('../Speech', 'mls_polish', 'train', download=True)

This will download, unzip, and split the labels. To download opus version of the dataset, simply add the suffix _opus. e.g. mls_polish_opus.

dataset[i] returns a dictionary containing:

{'path': '../Speech/mls_polish_opus/test/audio/8758/8338/8758_8338_000066.opus',
 'waveform': tensor([[ 1.8311e-04,  1.5259e-04,  1.5259e-04,  ...,  1.5259e-04,
           9.1553e-05, -3.0518e-05]]),
 'sample_rate': 48000,
 'utterance': 'i zaczynają z wielką ostrożnością rozdzierać jedwabistą powłokę w tem miejscu gdzie się znajduje głowa poczwarki gdyż młoda mrówka tak jest niedołężną że nawet wykluć się ze swego więzienia nie może bez obcej pomocy wyciągnąwszy ostrożnie więźnia który jest jeszcze omotany w rodzaj pieluszki',
 'speaker_id': 8758,
 'chapter_id': 8338,
 'utterance_id': 66}

Other functionalities

  1. extract_limited_train_set

dataset.extract_limited_train_set()

It extracts the 9hr and 1hr train sets into a new folder called limited_train. It would be useful for researchers who work on low-resource training.

  1. extract_labels

dataset.extract_labels(split_name, num_threads=0, IPA=False)

It splits the single text label .txt file into smaller per chapter .txt files. It dramastically improves the label loading efficiency. When setting up the dataset for the first time, self.extract_labels('train'), self.extract_labels('dev'), and self.extract_labels('test') are called automaically.

split_name: train, dev, test, limited_train

num_threads: Default 0. Determine how many threads are used to split the labels. Useful for larger dataset like English.

IPA: Default False. Set to True to extract IPA labels. Useful for phoneme recognition. Requires phomenizer and espeak.

MAPS

Introduction

MAPS dataset contains 9 folders, each folder contains 30 full music recordings and the aligned midi annoations. The two folders ENSTDkAm and ENSTDkCl contains real acoustic recording obtained from a YAMAHA Disklavier. The rest are synthesized audio clips. This ready-to-use PyTorch Dataset class will automatically set up most of the things.

Usage

To use this dataset for the first time, set download=True.

dataset = MAPS('./Folder', groups='all', download=True)

This will download, unzip, and extract the .tsv labels.

dataset[i] returns a dictionary containing:

{'path': '../MusicDataset/MAPS/AkPnBcht/MUS/MAPS_MUS-hay_40_1_AkPnBcht.wav',
 'sr': 44100,
 'audio': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 'midi': array([[  2.078941,   2.414137,  67.      ,  52.      ],
        [  2.078941,   2.414137,  59.      ,  43.      ],
        [  2.078941,   2.414137,  55.      ,  43.      ],
        ...,
        [394.169767, 394.867987,  59.      ,  56.      ],
        [394.189763, 394.867987,  62.      ,  56.      ],
        [394.209759, 394.867987,  67.      ,  62.      ]])}

Each row of midi represents a midi note, and it contains the information: [start_time, end_time, Midi_pitch, velocity].

The original audio clips are all steoro, users might want to convert them back to mono tracks first. Alternatively, the .resample() method can be also used to resample and convert tracks back to mono.

Getting a batch of audio segment

To generate a batch of audio segments and piano rolls, collect_batch(x, hop_size, sequence_length) should be used as the collate_fn of PyTorch DataLoader. The hop_size for collect_batch should be same as the spectrogram hop_size, so that the piano roll obtained aligns with the spectrogram.

loader = DataLoader(dataset, batch_size=4, collate_fn=lambda x: collect_batch(x, hop_size, sequence_length))
for batch in loader:
    audios = batch['audio'].to(device)
    frames = batch['frame'].to(device)

Other functionalities

  1. resample

dataset.resample(sr, output_format='flac')
dataset = MAPS('./Folder', groups='all', ext_audio='.flac')

Resample audio clips to the target sample rate sr and the target format output_format. This method requires pydub. After resampling, you need to create another instance of MAPS in order to load the new audio files instead of the original .wav files.

  1. extract_tsv

dataset.extract_tsv()

Convert midi files into tsv files for easy loading.

TODO

  1. MusicNet
  2. MAESTRO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AudioLoader-0.0.1a0.tar.gz (14.8 kB view hashes)

Uploaded Source

Built Distribution

AudioLoader-0.0.1a0-py3-none-any.whl (16.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page