Skip to main content

Python script for extracting,cleaning and tokenization YouTube video transcripts for Pre-Processing in machine learning.

Project description

Tube-Data: YouTube Video Transcript Extractor

Tube-Data is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.

Features

  • Extracts video transcripts from YouTube videos.
  • Saves cleaned transcripts into separate text files.
  • Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
  • Streamlines the dataset collection process for machine learning applications.
  • New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.

Installation

You can install TubeLearns using pip:

pip install tubelearns

Usage

Playlist Grabbing

from tubelearns import Acquisition

# Initialize the Acquisition class
model = Acquisition()

# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.playlist_grab(playlist_url, name="raw_data")

Extract Video Links from Playlist

# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.play2text(playlist_url)

Tokenization and Cleaning

from tubelearns.tokenizers import Tokenization, Cleaning

# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()

# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.tokenize_raw(text_data)
cleaned_data = cleaner.punct_raw(tokenized_data)

Refer to the TubeLearns documentation for detailed usage instructions and examples.

Contributing

If you'd like to contribute to TubeLearns or report issues, please check out the GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments


Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to get in touch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubelearns-1.1.7.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

tubelearns-1.1.7-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file tubelearns-1.1.7.tar.gz.

File metadata

  • Download URL: tubelearns-1.1.7.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.7.tar.gz
Algorithm Hash digest
SHA256 fc4e32ddc80ee4523dac81e1ad2f9677ec68babafe0a16da100c2317d240b97f
MD5 8e309d3206365691ffb04f32bf365943
BLAKE2b-256 cb669d72022aec4300d83cd49f9c497cf769111e773d913fbfefdebfa2692da1

See more details on using hashes here.

File details

Details for the file tubelearns-1.1.7-py3-none-any.whl.

File metadata

  • Download URL: tubelearns-1.1.7-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3c6ee50d740aa0925b41348f7f79c4a43fb9293827068b7a40db395260ccb6a9
MD5 126745b953a42eda0a383e9f56de2bde
BLAKE2b-256 a03e9230bb997cf9b948658b607ee24b8f27c808f57ebbfada07caa43666049e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page