Skip to main content

Internal dataset builder for the Phoneme Discovery benchmark

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Create the data splits and the annotation files

-get the dataset audio from:[https://datacollective.mozillafoundation.org/datasets/cmflnuzw71qkz8x3kil3tgjvk]

  • get the textGrid from: https://huggingface.co/datasets/pacscilab/VoxCommunis/tree/main/textgrids

  • test languages:

  • Japanese(jp), espagnole?(es),Mandarin (zh-CN), basque(eu)

  • To-add: Germain,English, frensh

  • dev languages:

  • Turkish, ukrainian, Tamil, thai

    1. Convert to wav and reasmple audio files
    2. Prepare dataset splits based on the validate file and sets from cm
    3. Merge all the output files to have a best distribution
    4. order speaker based on speech duration
    5. split sets balanced with dev,test 2hours each and all the rest to train
    6. align files
    7. correct phones
    8. phonebase item files
    9. triphone item files
    10. clean and correct item files
    11. copy audio files for dev and test
    12. cereate subfolders and copy audio files for train set

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discophon_builder-0.0.1.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discophon_builder-0.0.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file discophon_builder-0.0.1.tar.gz.

File metadata

  • Download URL: discophon_builder-0.0.1.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for discophon_builder-0.0.1.tar.gz
Algorithm Hash digest
SHA256 178d4de69dd282be1393eb2c4b3009d606c9d9a4667aa9e2e073da74b0d4eb8a
MD5 7c1e73ddb9fc44ae826b3744d8148825
BLAKE2b-256 0e809d64c20c0e663897142e896968374c700c8dcb0530ab63b82ae16cc6d5da

See more details on using hashes here.

File details

Details for the file discophon_builder-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: discophon_builder-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for discophon_builder-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8c349086d1f708e109fb12f3e0dac12acf719a86d1fd4822850215af0d8debfd
MD5 4708e4f3ab80c10e4122811b0b7be416
BLAKE2b-256 1760d608400f78eab8a9b56e60db67197d9710494f2d5fed1889324359ff51f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page