Skip to main content

Holistic Evaluation of Audio Representations (HEAR) 2021 -- Preprocessing Pipeline

Project description

HEAR2021

hear-preprocess

Dataset preprocessing code for the HEAR 2021 NeurIPS competition.

Unless you are a HEAR organizer or want to contribute a task, you won't need this repo. Use hear-eval-kit to evaluate your embedding models on these tasks.

This preprocessing is slow and disk-intensive but safe and careful.

Cloud Usage

See hear-eval's README.spotty for information on how to use spotty.

Installation

pip3 install hearpreprocess

Tested with Python 3.7 and 3.8. Python 3.9 is not officially supported because pip3 installs are very finicky, but it might work.

Development

Clone repo:

git clone https://github.com/neuralaudio/hear-preprocess
cd hear-preprocess

Add secret task submodule:

git submodule init
git submodule update --remote

NOTE: Secret tasks are not available to participants. You should skip the above step.

Install in development mode:

pip3 install -e ".[dev]"

Make sure you have pre-commit hooks installed:

pre-commit install

Running tests:

python3 -m pytest

Preprocessing

You probably don't need to do this unless you are implementing the HEAR challenge.

If you want to run preprocessing yourself:

  • You will need ffmpeg>=4.2 installed (possibly from conda-forge).
  • You will need soxr support, which might require package libsox-fmt-ffmpeg or installing from source.

When using 'mode --default', this will take about several hours for the open tasks. 150 GB free disk space is required while processing. Final output is 11 GB.

mode --all (speech_commands full and nsynth 50h), on n1-standard-8, 16.5 hours. 560GB working disk, including final output. Final output 138GB.

These Luigi pipelines are used to preprocess the evaluation tasks into a common format for downstream evaluation.

To run the preprocessing pipeline for all available tasks:

python3 -m hearpreprocess.runner all

Upload to private bucket:

gsutil -m cp hear-*.tar.gz gs://hear2021-private/

Upload to open bucket:

gsutil -m cp hear-*dcase2016_task2*.tar.gz gs://hear2021/open-tasks/
gsutil -m cp hear-*speech_commands*.tar.gz gs://hear2021/open-tasks/
gsutil -m cp hear-*nsynth_pitch*.tar.gz gs://hear2021/open-tasks/

Small open tasks can be put in the cloud as follows:

gsutil -m cp hear-*dcase2016_task2*small*.tar.gz gs://hear2021/small/
gsutil -m cp hear-*speech_commands*small*.tar.gz gs://hear2021/small/
gsutil -m cp hear-*nsynth_pitch*small*.tar.gz gs://hear2021/small/

You can also just run individual tasks:

python3 -m hearpreprocess.runner [speech_commands|nsynth_pitch|office_events]

NOTE_: To run the pipeline on secret tasks please ensure to initialize, update, and install the hear2021-secret-tasks submodule. This repository is not available for participants. If the submodule is set up:

  • The aforementioned commands will work for secret tasks as well.
  • Running with the task all option will trigger all the available set of open and secret tasks.
  • To run individual tasks, please use the corresponding task name. The secret task names are are also hidden and listed in the hear2021-secret-tasks submodule.

Each pipeline will download and preprocess each dataset according to the following DAG:

  • DownloadCorpus
  • ExtractArchive
  • ExtractMetadata: Create splits over the entire corpus and find the label metadata for them.
  • SubcorpusSplit (subsample each split) => MonoWavSplit => TrimPadSplit => SubcorpusData (symlinks)
  • SubcorpusData => {SubcorpusMetadata, ResampleSubcorpus}
  • SubcorpusMetadata => MetadataVocabulary
  • FinalCombine => TarCorpus => FinalizeCorpus

In terms of sampling:

  • We create a 60/20/20 split if train/valid/test does not exist.
  • We cap each split at 3/1/1/ hours of audio, defined as
  • If further small sampling happens, that chooses a particular number of audio samples per task.

These commands will download and preprocess the entire dataset. An intermediary directory defined by the option luigi-dir(default _workdir) will be created, and then a final directory defined by the option tasks-dir (default tasks) will contain the completed dataset.

Options:

Options:
  --num-workers INTEGER  Number of CPU workers to use when running. If not
                         provided all CPUs are used.
  --sample-rate INTEGER  Perform resampling only to this sample rate. By
                         default we resample to 16000, 22050, 44100, 48000.
  --tmp-dir TEXT         Temporary directory to save all the intermediate
                         tasks (will not be deleted afterwords). (default:
                         _workdir/)
  --tasks-dir TEXT       Directory to save the final task output (default:
                         tasks/)
  --tar-dir TEXT         Directory to save the tar'ed output (default: .)
  --mode TEXT            default, all, or small mode for each task.
  --help                 Show this message and exit.

To check the stats of an audio directory:

python3 -m hearpreprocess.audio_dir_stats {input folder} {output json file}

Stats include: audio_count, audio_samplerate_count, mean meadian and certain (10, 25, 75, 90) percentile durations. This is helpful in getting a quick glance of the audio files in a folder and helps in decideing the preprocessing configurations.

The pipeline will also generate some stats of the original and preprocessed data sets, e.g.:

speech_commands-v0.0.2/01-ExtractArchive/test_stats.json
speech_commands-v0.0.2/01-ExtractArchive/train_stats.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_test.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_train.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_valid.json

Faster preprocessing, for development

The small flag runs the preprocessing pipeline on a small version of each dataset stored at Downsampled HEAR Open Tasks. This is used for development and continuous integration tests for the pipeline.

These small versions of the data can be generated deterministically with the following command:

python3 -m hearpreprocess.sampler <taskname>

NOTE : --mode small is used to run the task on a small version of the dataset for development.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hearpreprocess-2021.0.5.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hearpreprocess-2021.0.5-py3-none-any.whl (37.6 kB view details)

Uploaded Python 3

File details

Details for the file hearpreprocess-2021.0.5.tar.gz.

File metadata

  • Download URL: hearpreprocess-2021.0.5.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.7.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.7

File hashes

Hashes for hearpreprocess-2021.0.5.tar.gz
Algorithm Hash digest
SHA256 ff9b7f56a774770a3ffd630e36afac2a135182c49059d3cadbd6b588206a3980
MD5 01df41b8d51a7f6b97ad2efb3a95a49e
BLAKE2b-256 a1213befcd5083ebbe93a8ae18196bf7551965ee3881a11ee038b1e5cfc52fbf

See more details on using hashes here.

File details

Details for the file hearpreprocess-2021.0.5-py3-none-any.whl.

File metadata

  • Download URL: hearpreprocess-2021.0.5-py3-none-any.whl
  • Upload date:
  • Size: 37.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.7.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.7

File hashes

Hashes for hearpreprocess-2021.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9c284bc0a6d0ca58f96b075d55e2f9bc6c77d084106f1ec422b14b4a470f9c8a
MD5 0ee399f503ef1fc774b55894f5367cf8
BLAKE2b-256 c825e2c25ae5091ddfec099cd8996bc225013f062b44095e85b4d8b315d72d7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page