Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
Project description
speech-dataset-parser
Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
Speech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., T|h|i|s| |i|s| |a| |t|e|x|t|.
Generic Format
The format is as follows: {Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}
Example: LJ Speech/Linda Johnson;2;eng;North American/wavs/...
Speaker names can be any string (excluding ;
symbols).
Genders are defined via their ISO/IEC 5218 Code.
Languages are defined via their ISO 639-2 Code.
Accents are optional and can be any string (excluding ;
symbols).
Installation
pip install speech-dataset-parser --user
Library Usage
from speech_dataset_parser import parse_dataset
entries = list(parse_dataset(folder..., grid-tier-name...))
The resulting entries
list contains dataclass instances with these properties:
symbols: Tuple[str, ...]
intervals: Tuple[float, ...]
symbols_language: str
speaker_name: str
speaker_accent: str
speaker_gender: int
audio_file_abs: Path
min_time: float
max_time: float
CLI Usage
dataset-converter-cli [-h] [-v] {convert-ljs} ...
CLI Features
convert-ljs
: convert LJ Speech dataset to a generic dataset
CLI Example
# Convert LJ Speech dataset with symbolic links to the audio files
dataset-converter-cli convert-ljs \
"/data/datasets/LJSpeech-1.1" \
"/tmp/ljs" \
--tier "Symbols" \
--symlink
Dependencies
- tqdm
- TextGrid>=1.5
- ordered_set>=4.1.0
Roadmap
- Supporting conversion of more datasets
- Adding tests
License
MIT License
Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
Citation
If you want to cite this repo, you can use this BibTeX-entry:
@misc{tssdp22,
author = {Taubert, Stefan},
title = {speech-dataset-parser},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/stefantaubert/speech-dataset-parser}}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speech-dataset-parser-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b24ab98dd995e81dd8608bcd5d9c8c4faede61163d0e17ec8992a1cdd49fa3d |
|
MD5 | c5e5dfbeac605301096ff3dd8994bc87 |
|
BLAKE2b-256 | c896ede845903fe1d1213142b50db1903f5d237b222d0ad84447243a15d54344 |
Hashes for speech_dataset_parser-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a4b9845fb5e7d2519445175ee49a431daa76b6ddfe2d566980ae896787b0b2e |
|
MD5 | f2a783115afcbaa7f1bbfea179f84637 |
|
BLAKE2b-256 | 4fabbaa12dd06aaf32d9c242364df63866b3cb0bab0c0d97074b306303cacabd |