Skip to main content

Internal dataset builder for the Phoneme Discovery benchmark

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Create the data splits and the annotation files

-get the dataset audio from:[https://datacollective.mozillafoundation.org/datasets/cmflnuzw71qkz8x3kil3tgjvk]

  • get the textGrid from: https://huggingface.co/datasets/pacscilab/VoxCommunis/tree/main/textgrids

  • test languages:

  • Japanese(jp), espagnole?(es),Mandarin (zh-CN), basque(eu)

  • To-add: Germain,English, frensh

  • dev languages:

  • Turkish, ukrainian, Tamil, thai

    1. Convert to wav and reasmple audio files
    2. Prepare dataset splits based on the validate file and sets from cm
    3. Merge all the output files to have a best distribution
    4. order speaker based on speech duration
    5. split sets balanced with dev,test 2hours each and all the rest to train
    6. align files
    7. correct phones
    8. phonebase item files
    9. triphone item files
    10. clean and correct item files
    11. copy audio files for dev and test
    12. cereate subfolders and copy audio files for train set

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discophon_builder-0.0.2.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discophon_builder-0.0.2-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file discophon_builder-0.0.2.tar.gz.

File metadata

  • Download URL: discophon_builder-0.0.2.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for discophon_builder-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2ee122e9695c99098555b09692fea91a3265e631ae18c157c6fdebdb7e10a05e
MD5 90a834e09566946b1a045150a19aef72
BLAKE2b-256 06f5cc08a44ead5b55b089b1f9b16b3587505b0b3903714ca59b7dc53701da37

See more details on using hashes here.

File details

Details for the file discophon_builder-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: discophon_builder-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for discophon_builder-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1fa148d5f59a9b34160b079604fe605be13d066eaee1c90cae299b90146335a0
MD5 030923fe47e87acf299d9fe12fa9b0d5
BLAKE2b-256 7d6bf7c71f89aea72e60b168b4a2106bbff018361d9cc965b94b267642ff4b76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page