Python script for extracting,cleaning and tokenization YouTube video transcripts for Pre-Processing in machine learning.
Project description
Tube-Data: YouTube Video Transcript Extractor
Tube-Data is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.
Features
- Extracts video transcripts from YouTube videos.
- Saves cleaned transcripts into separate text files.
- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
- Streamlines the dataset collection process for machine learning applications.
- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.
Installation
You can install TubeLearns using pip:
pip install tubelearns
Usage
Playlist Grabbing
from tubelearns import Acquisition
# Initialize the Acquisition class
model = Acquisition()
# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.playlist_grab(playlist_url, name="raw_data")
Extract Video Links from Playlist
# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.play2text(playlist_url)
Tokenization and Cleaning
from tubelearns.tokenizers import Tokenization, Cleaning
# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()
# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.tokenize_raw(text_data)
cleaned_data = cleaner.punct_raw(tokenized_data)
Refer to the TubeLearns documentation for detailed usage instructions and examples.
Contributing
If you'd like to contribute to TubeLearns or report issues, please check out the GitHub repository.
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Acknowledgments
Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to get in touch.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tubelearns-1.1.6.tar.gz
.
File metadata
- Download URL: tubelearns-1.1.6.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bf84ea4715dac749bf11e94d72888e7e9e3c128d1a1b7cd300b3a2c5b1f154e |
|
MD5 | 13df43015d06eb5d823dc80670675052 |
|
BLAKE2b-256 | 26f3da69292bbc69f4bf3599e5f6f4d4f63c950d47b858dc1f6e7be1e4725b86 |
File details
Details for the file tubelearns-1.1.6-py3-none-any.whl
.
File metadata
- Download URL: tubelearns-1.1.6-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0608928e07a46505a3eec1dbb60df483cf6bd84cb8eebf8ab84c09ede6860ac6 |
|
MD5 | b64f08365d80790b9ce92f9cc2fa68bc |
|
BLAKE2b-256 | f922168da7babaa9f851e5f42925cdcea2c5aca23a8a278a3532f2bd56757420 |