Skip to main content

A python library that generates speech data with transcriptions by collecting data from YouTube.

Project description

Youtube Speech Data Generator

License Code style: black

A python library to generate speech dataset. Youtube Speech Data Generator also takes care of almost all your speech data preprocessing needed to build a speech dataset along with their transcriptions making sure it follows a directory structure followed by most of the text-to-speech architectures.

Installation

Make sure ffmpeg is installed and is set to the system path.

$ pip install youtube-tts-data-generator

Minimal start for creating the dataset

from youtube_tts_data_generator import YTSpeechDataGenerator

# First create a YTSpeechDataGenerator instance:

generator = YTSpeechDataGenerator(dataset_name='elon')

# Now create a '.txt' file that contains a list of YouTube videos that contains speeches.
# NOTE - Make sure you choose videos with subtitles.

generator.prepare_dataset('links.txt')
# The above will take care about creating your dataset, creating a metadata file and trimming silence from the audios.

Usage

  • Initializing the generator: generator = YTSpeechDataGenerator(dataset_name='your_dataset')

    • Parameters:
      • dataset_name:
        • The name of the dataset you'd like to give.
        • A directory structure like this will be created:
          ├───your_dataset
          │   ├───txts
          │   └───wavs
          └───your_dataset_prep
              ├───concatenated
              ├───downloaded
              └───split
          
      • output_type:
        • The type of the metadata to be created after the dataset has been generated.
        • Supported types: csv/json
        • Default output type is set to csv
        • The csv file follows the format of LJ Speech Dataset
        • The json file follows this format:
          {
              "your_dataset1.wav": "This is an example text",
              "your_dataset2.wav": "This is an another example text",
          }
          
      • keep_audio_extension:
        • Whether to keep the audio file extension in the metadata file
        • Default value is set to False
  • Methods:

    • download():
      • Downloads video files from YouTube along with their subtitles and saves them as wav files.
      • Parameters:
        • links_txt:
          • Path to the '.txt' file that contains the urls for the videos.
      • Usage of this method is optional. If you do not use this method, make sure to place all the audio and subtitle files in 'your_dataset_prep/downloaded' directory.
      • Then, create a file called 'files.txt' and again place it under 'your_dataset_prep/downloaded'. 'files.txt' should follow the following format:
        filename,subtitle,trim_min_begin,trim_min_end
        audio.wav,subtitle.srt,0,0
        audio2.wav,subtitle.vtt,5,6
        
      • Create a '.txt' file that contains a list of YouTube videos that contains speeches.
      • Example - generator.download('links.txt')
    • split_audios():
      • This method splits all the wav files into smaller chunks according to the duration of the text in the subtitles.
      • Saves transcriptions as '.txt' file for each of the chunks.
      • Example - generator.split_audios()
    • concat_audios():
      • Since the split audios are based on the duration of their subtitles, they might not be so long. This method joins the split files into recognizable ones.
      • Example - generator.concat_audios()
    • finalize_dataset():
      • Trims silence the joined audios since the data has been collected from YouTube and generates the final dataset after finishing all the preprocessing.
      • Parameters:
        • min_audio_length:
          • The minumum length of the speech that should be kept. The rest will be ignored.
          • The default value is set set to 7.
        • max_audio_length:
          • The maximum length of the speech that should be kept. The rest will be ignored.
          • The default value is set set to 14.
      • Example - generator.finalize_dataset(min_audio_length=6)
    • get_total_audio_length():
      • Returns the total amount of preprocessed speech data collected by the generator.
      • Example - generator.get_total_audio_length()
    • prepare_dataset():
      • A wrapper method for download(),split_audios(),concat_audios() and finalize_dataset().
      • If you do not wish to use the above methods, you can directly call prepare_dataset(). It will handle all your data generation.
      • Parameters:
        • links_txt:
          • Path to the '.txt' file that contains the urls for the videos.
        • download_youtube_data:
          • Whether to download audios from YouTube.
          • Default value is True
        • min_audio_length:
          • The minumum length of the speech that should be kept. The rest will be ignored.
          • The default value is set set to 7.
        • max_audio_length:
          • The maximum length of the speech that should be kept. The rest will be ignored.
          • The default value is set set to 14.
      • Example - generator.prepare_dataset(links_txt='links.txt', download_youtube_data=True, min_audio_length=6)

Final dataset structure

Once the dataset has been created, the structure under 'your_dataset' directory should look like:

your_dataset
├───txts
│   ├───your_dataset1.txt
│   └───your_dataset2.txt
├───wavs
│    ├───your_dataset1.wav
│    └───your_dataset2.wav
└───metadata.csv/alignment.json

NOTE - audio.py is highly based on Real Time Voice Cloning

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

youtube_tts_data_generator-0.1.6.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

youtube_tts_data_generator-0.1.6-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file youtube_tts_data_generator-0.1.6.tar.gz.

File metadata

  • Download URL: youtube_tts_data_generator-0.1.6.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for youtube_tts_data_generator-0.1.6.tar.gz
Algorithm Hash digest
SHA256 f696dbb0e9aa8b14777896bb8ba0821ee1f02f12ba902b6659b99133bd4a454c
MD5 91560d6c12f418c96fc0c4a26c0f487c
BLAKE2b-256 d9b7fe1162723e8ab64231c36320668de5c768bba15e6d1f25967a9690767588

See more details on using hashes here.

File details

Details for the file youtube_tts_data_generator-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: youtube_tts_data_generator-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for youtube_tts_data_generator-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b5dd43225200dc6e3a4cbff85476bb372a059efc96062532c55355b7e16d628e
MD5 2b6e70c72c321a0753a010fde6a6b940
BLAKE2b-256 8df7270b27de39b6461c7c5bbce55bf664934bc3a853aaf286e5b7cdc57b63a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page