Skip to main content

Generates a dataset for the Turkish speech recognition.

Project description

ArdicSrtCollector

MIT license Python PyPI

ArdicSrtCollector has been developed to generate the Turkish speech recognition dataset. As parameters, it takes a txt file consisting of the links of these Youtube videos and a folder name to store the files to be created. For each youtube video URL, it downloads the audio file, extracts subtitles as the SRT format, and saves as two new files to the disk. Then it cropped (using FFMPEG) the audio file according to the start and end time of each subtitle and creates a new audio file, and at the same time saves the current subtitle as a new txt file.

Installation

  1. Install ffmpeg.(it is re)
  2. Run $ pip install ardicsrtcollector.

Usage

1- From the terminal

ardicsrtcollector [-h] [-sv SAVE_PATH] -ufp URL_FILE_PATH

To convert the Youtube URL to mp3 and srt file.

optional arguments:
  -h, --help            show this help message and exit
  -sv SAVE_PATH, --save_path SAVE_PATH
                        Path to save converted files (default: downloads_convert)
  -ufp URL_FILE_PATH, --url_file_path URL_FILE_PATH
                        A file which contains youtube URLs
Example

Run on terminal :ardicsrtcollector -ufp urls.txt

2- Using it by importing as a package like the one below.

from ardicsrtcollector.youtube_srt_mp3 import YoutubeSrtMp3

YoutubeSrtMp3(urls_file_path="urls.txt", save_dir="save_path").convert()

The content of the file containing the URLs should be as follows.

https://www.youtube.com/watch?v=ENwtC8LgPcw
https://www.youtube.com/watch?v=ENwtC8LgPcw
...

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ardicsrtcollector-1.0.13.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

ardicsrtcollector-1.0.13-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file ardicsrtcollector-1.0.13.tar.gz.

File metadata

  • Download URL: ardicsrtcollector-1.0.13.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.7

File hashes

Hashes for ardicsrtcollector-1.0.13.tar.gz
Algorithm Hash digest
SHA256 958d0910f58b3a413b8726ea196ff5ee831b9770f2f4963c7e07a8de72c0b9de
MD5 40f628f233f18e3bc01330e14aaa3149
BLAKE2b-256 55028a2f7c7dd5bc56f8f38066ee99cd7e6ba358c67c4ae9ec50b602d4a2c3f1

See more details on using hashes here.

File details

Details for the file ardicsrtcollector-1.0.13-py3-none-any.whl.

File metadata

  • Download URL: ardicsrtcollector-1.0.13-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.7

File hashes

Hashes for ardicsrtcollector-1.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 659ae83ea163af5cd04a7ea66a29ce1915bf6eca9336d411fcced476299e6d7b
MD5 f3ad18f02ad9bde4046f37c5d3dbdee6
BLAKE2b-256 f2c69bf372f426d3806627c5432cb3082d37f67dddd624b2cfda2e2483e6b026

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page