Python3 library for converting between various audio dataset formats.
Project description
The audio-dataset-converter library allows the conversion between various dataset formats of audio datasets. Filters can be supplied as well, e.g., for cleaning up the data.
Dataset formats:
classification: ADAMS (r/w), sub-dir (r/w), TXT (r/w)
speech: ADAMS (r/w), CommonVoice (r/w), Festvox (r/w), Huggingface Audiofolder (r/w), TXT (r/w)
Examples can be found here:
https://github.com/waikato-llm/audio-dataset-converter-examples
Changelog
0.1.0 (2025-10-31)
split-records filter now allows specifying the meta-data field in which to store the split name
the tee meta-filter can now forward or drop the incoming data based on a meta-data evaluation
the sub-process filter can be used for processing data with sub-flow of filters, can be conditional based on meta-data evaluation
the metadata-from-name filter can work on the path now as well (must be present)
switched to kasperl library for base API and generic pipeline plugins
added @abc.abstractmethod decorator where appropriate
the adc-exec tool now uses all remaining parameters as the pipeline components rather than having to specify them via the -p/–pipeline parameter, making it easy to simply prefix the adc-exec command to an existing adc-convert command-line
added the text-file and csv-file generators that work off files to populate the variable(s)
added support for class lister with ignored classes
adc-exec can load pipelines from file now as well, useful when dealing with large pipelines
added –load_pipeline option to adc-convert
added from-text-file reader and to-text-file writer
readers now locate files the first time the read() method gets called rather than in the initialized(), to allow more dynamic placeholders
added from-text-file reader and to-text-file writer
added block, stop filters for controlling the flow of data (via meta-data conditions)
added email support with get-email reader and send-email writer
added list-files reader for listing files in a directory
added list-to-sequence stream filter that forwards list items one by one
added console writer for outputting the data on stdout that is coming through
added watch-dir meta-reader that uses the watchdog library to react to file-system events rather than using fixed-interval polling like poll-dir
added delete-files writer
added copy-files filter
added support for caching plugins via ADC_CLASS_CACHE environment variable
added to-metadata writer that outputs the meta-data of an image
added attach-metadata filter that loads meta-data from a directory and attaches it to the data passing through
added annotation-to-storage and annotation-from-storage filters
annotation data is now being type-checked when setting it
requiring seppl>=0.3.0 now
0.0.4 (2025-07-15)
requiring seppl>=0.2.20 now for improved help requests in adc-convert tool
0.0.3 (2025-07-10)
added set-placeholder filter for dynamically setting (temporary) placeholders at runtime
added –resume_from option to relevant readers that allows resuming the data processing from the first file that matches this glob expression (e.g., */012345.wav)
requiring seppl>=0.2.17 now for resume, split group, skippable plugin support and avoiding deprecated use of pkg_resources
to-adams-sp writer now uses -t short flag for the transcript like the from-adams-sp reader
added the from-multi meta-reader that combines multiple base readers and returns their output
added the to-multi meta-writer that forwards the data to multiple base writers
using wai_common instead of wai.common now
added split_group parameter to splittable writers (stream/batch)
fixed the construction of the error messages in the pyfunc reader/filter/writer classes
added metadata-to-placeholder filter to transfer meta-data files into placeholders
0.0.2 (2025-03-14)
added setuptools as dependency
switched to underscores in project name
added discard-by-name filter
requiring seppl>=0.2.13 now
added support for aliases
added placeholder support to tools: adc-convert, adc-exec
added placeholder support to readers: from-adams-ac, from-subdir-ac, from-txt-ac, from-adams-sp, from-commonvoice-sp, from-festvox-sp, from-hf-audiofolder-sp, from-txt-sp, from-data, poll-dir, from-pyfunc
added placeholder support to writers: to-adams-ac, to-subdir-ac, to-txt-ac, to-adams-sp, to-commonvoice-sp, to-festvox-sp, to-hf-audiofolder-sp, to-txt-sp, to-audioinfo, to-data
0.0.1 (2024-07-05)
initial release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file audio_dataset_converter-0.1.0.tar.gz.
File metadata
- Download URL: audio_dataset_converter-0.1.0.tar.gz
- Upload date:
- Size: 41.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f12c569164427ec5caaa14e2e2982473895ef76e8fca448eb9f1c1edc3d0eee4
|
|
| MD5 |
d8c942fa4106965743ca63ec955fedaf
|
|
| BLAKE2b-256 |
1801dbf7ba6f965d2fa0e2b70b39bf1ae747a644ef8477936d4b273659712d57
|