Holistic Evaluation of Audio Representations (HEAR) 2021 -- Preprocessing Pipeline
Project description
hear-preprocess
Dataset preprocessing code for the HEAR 2021 NeurIPS competition.
Unless you are a HEAR organizer or want to contribute a task, you won't need this repo. Use hear-eval-kit to evaluate your embedding models on these tasks.
Cloud Usage
See hear-eval's README.spotty for information on how to use spotty.
Installation
pip3 install hearpreprocess
Tested with Python 3.7 and 3.8. Python 3.9 is not officially supported because pip3 installs are very finicky, but it might work.
Development
Clone repo:
git clone https://github.com/neuralaudio/hear-preprocess
cd hear-preprocess
Add secret task submodule:
git submodule init
git submodule update
NOTE: Secret tasks are not available to participants. You should skip the above step.
Install in development mode:
pip3 install -e ".[dev]"
Make sure you have pre-commit hooks installed:
pre-commit install
Running tests:
python3 -m pytest
Preprocessing
You probably don't need to do this unless you are implementing the HEAR challenge.
If you want to run preprocessing yourself:
- You will need
ffmpeg>=4.2
installed (possibly from conda-forge). - You will need
soxr
support, which might require package libsox-fmt-ffmpeg or installing from source.
This will take about 2 user-CPU-hours for the open tasks. 100 GB free disk space is required while processing. Final output is 11 GB.
These Luigi pipelines are used to preprocess the evaluation tasks into a common format for downstream evaluation.
To run the preprocessing pipeline for all available tasks:
python3 -m hearpreprocess.runner all
You can also just run individual tasks:
python3 -m hearpreprocess.runner [speech_commands|nsynth_pitch|office_events]
NOTE_: To run the pipeline on secret tasks please ensure to
initialize, update, and install the hear2021-secret-tasks
submodule.
This repository is not available for participants. If the submodule
is set up:
- The aforementioned commands will work for secret tasks as well.
- Running with the
all
option will trigger all the available set of open and secret tasks. - To run individual tasks, please use the corresponding
task
name. The secret task names are are also hidden and listed in thehear2021-secret-tasks
submodule.
Each pipeline will download and preprocess each dataset according to the following DAG:
- DownloadCorpus
- ExtractArchive
- ExtractMetadata: Create splits over the entire corpus and find the label metadata for them.
- SubsampleSplit (subsample each split) => MonoWavTrimCorpus => SubsampledData (symlinks)
- SubsampledData => {SubsampledMetadata, ResampleSubcorpus}
- SubsampledMetadata => MetadataVocabulary
- FinalizeCorpus
In terms of sampling:
- We create a 60/20/20 split if train/valid/test does not exist.
- We cap each split at 3/1/1/ hours of audio, defined as
- If further small sampling happens, that chooses a particular number of audio samples per task.
These commands will download and preprocess the entire dataset. An
intermediary directory defined by the option luigi-dir
(default
_workdir
) will be created, and then a final directory defined by
the option tasks-dir
(default tasks
) will contain the completed
dataset.
Options:
Options:
--num-workers INTEGER Number of CPU workers to use when running. If not
provided all CPUs are used.
--sample-rate INTEGER Perform resampling only to this sample rate. By
default we resample to 16000, 22050, 44100, 48000.
--small FLAG If passed, the task will run on a small-version of the
data.
--work-dir STRING Temporary directory to save all the
intermediate tasks (will not be deleted afterwords).
It will require as much disk space as the final output,
if not more.
By default this is set to _workdir in the
module root directory.
--tasks-dir STRING Path to dir to store the final task outputs.
By default this is set to tasks in the
module root directory
To check the stats of an audio directory:
python3 -m hearpreprocess.audio_dir_stats {input folder} {output json file}
Stats include: audio_count, audio_samplerate_count, mean meadian and certain (10, 25, 75, 90) percentile durations. This is helpful in getting a quick glance of the audio files in a folder and helps in decideing the preprocessing configurations.
The pipeline will also generate some stats of the original and preprocessed data sets, e.g.:
speech_commands-v0.0.2/01-ExtractArchive/test_stats.json
speech_commands-v0.0.2/01-ExtractArchive/train_stats.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_test.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_train.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_valid.json
Faster preprocessing, for development
The small flag runs the preprocessing pipeline on a small version of each dataset stored at Downsampled HEAR Open Tasks. This is used for development and continuous integration tests for the pipeline.
These small versions of the data can be generated deterministically with the following command:
python3 -m hearpreprocess.sampler <taskname>
NOTE : The --small
flag which is used to run the task on a
small version of the dataset for development.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hearpreprocess-2021.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6f48d3d7f531edff2b48d86a71a31b33ce56b13d568bf18a2fd7d0fe05f15b6 |
|
MD5 | b7149f147a66ed18fd3996d7a8054735 |
|
BLAKE2b-256 | 6e3b83023e18f469660bf9a28b369f08dfc1aefc549b430fc6db56aefb763d55 |