Skip to main content

Python script for extracting, cleaning, and tokenizing YouTube video transcripts for Pre-Processing in machine learning.

Project description

TubeLearns: YouTube Video Transcript Extractor

TubeLearns is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.

Features

  • Extracts video transcripts from YouTube videos.
  • Saves cleaned transcripts into separate text files.
  • Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
  • Streamlines the dataset collection process for machine learning applications.
  • New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.

Installation

You can install TubeLearns using pip:

pip install tubelearns

Usage

Playlist Grabbing

from tubelearns import Acquisition

# Initialize the Acquisition class
model = Acquisition()

# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.PlaylistGrab(playlist_url, name="raw_data")

Extract Video Links from Playlist

# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.Play2Text(playlist_url)

Tokenization and Cleaning

from tubelearns.tokenizers import Tokenization, Cleaning

# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()

# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.TokenizeRaw(text_data)
cleaned_data = tokenizer.PunctList(tokenized_data)

Refer to the TubeLearns documentation for detailed usage instructions and examples.

Contributing

If you'd like to contribute to TubeLearns or report issues, please check out the GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments


Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to get in touch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubelearns-2.1.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

tubelearns-2.1.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file tubelearns-2.1.0.tar.gz.

File metadata

  • Download URL: tubelearns-2.1.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for tubelearns-2.1.0.tar.gz
Algorithm Hash digest
SHA256 d884ec2870005914107d6755df783723cd328bfbf48a3b85753b234d222e4a57
MD5 b321541d5dd3693c2d234b85905e0cdf
BLAKE2b-256 a65571245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07

See more details on using hashes here.

File details

Details for the file tubelearns-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: tubelearns-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for tubelearns-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bc045bce373fbaf0492d06672e02a1f4a79783b5c4485601a27772f37902d47
MD5 a450abbfa4cc302f2727cc77bd5f8383
BLAKE2b-256 0ad1f39b8562c2512b70add81e712f60df10e6ff26230893a655148da707f9a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page