Skip to main content

Python script for extracting,cleaning and tokenization YouTube video transcripts for Pre-Processing in machine learning.

Project description

Tube-Data: YouTube Video Transcript Extractor

Tube-Data is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.

Features

  • Extracts video transcripts from YouTube videos.
  • Saves cleaned transcripts into separate text files.
  • Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
  • Streamlines the dataset collection process for machine learning applications.
  • New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.

Installation

You can install TubeLearns using pip:

pip install tubelearns

Usage

Playlist Grabbing

from tubelearns import Acquisition

# Initialize the Acquisition class
model = Acquisition()

# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.playlist_grab(playlist_url, name="raw_data")

Extract Video Links from Playlist

# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.play2text(playlist_url)

Tokenization and Cleaning

from tubelearns.tokenizers import Tokenization, Cleaning

# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()

# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.tokenize_raw(text_data)
cleaned_data = cleaner.punct_raw(tokenized_data)

Refer to the TubeLearns documentation for detailed usage instructions and examples.

Contributing

If you'd like to contribute to TubeLearns or report issues, please check out the GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments


Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to get in touch.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubelearns-1.1.6.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

tubelearns-1.1.6-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file tubelearns-1.1.6.tar.gz.

File metadata

  • Download URL: tubelearns-1.1.6.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.6.tar.gz
Algorithm Hash digest
SHA256 7bf84ea4715dac749bf11e94d72888e7e9e3c128d1a1b7cd300b3a2c5b1f154e
MD5 13df43015d06eb5d823dc80670675052
BLAKE2b-256 26f3da69292bbc69f4bf3599e5f6f4d4f63c950d47b858dc1f6e7be1e4725b86

See more details on using hashes here.

File details

Details for the file tubelearns-1.1.6-py3-none-any.whl.

File metadata

  • Download URL: tubelearns-1.1.6-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 0608928e07a46505a3eec1dbb60df483cf6bd84cb8eebf8ab84c09ede6860ac6
MD5 b64f08365d80790b9ce92f9cc2fa68bc
BLAKE2b-256 f922168da7babaa9f851e5f42925cdcea2c5aca23a8a278a3532f2bd56757420

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page