Python script for extracting,cleaning and tokenization YouTube video transcripts for Pre-Processing in machine learning.
Project description
Tube-Data: YouTube Video Transcript Extractor
Tube-Data is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.
Features
- Extracts video transcripts from YouTube videos.
- Saves cleaned transcripts into separate text files.
- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
- Streamlines the dataset collection process for machine learning applications.
- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.
Installation
You can install the required dependencies using pip:
pip install tubelearns
Usage
Extract Transcripts from a List of Video URLs
from tubelearns import text_link
# Provide a path to a text file containing YouTube video URLs.
text_link('path_to_file.txt', name='output_folder_name')
Extract Transcript from a Single Video URL
from tubelearns import url_grab
# Provide a single YouTube video URL.
url_grab('video_url', name='output_folder_name')
Extract Transcripts from a YouTube Playlist
from tubelearns import playlist_grab
# Provide the URL of a YouTube playlist.
playlist_grab('playlist_url', name='output_folder_name')
Cleaning and Punctuation Removal
from tubelearns import Cleaning
# Initialize the Cleaning class
cleaner = Cleaning()
# Clean and remove punctuation from text
content = "Hey! hope you good"
cleaned_text = cleaner.punct_raw(content)
print(cleaned_text)
Tokenization
from tubelearns import Tokenization
# Initialize the Tokenization class
tokenizer = Tokenization()
# Tokenize text
content = "Hello sam. How are you."
tokenized_text = tokenizer.tokenize_raw(content)
print(tokenized_text)
Development Status
This project is currently in the planning stage.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions
Contributions are welcome! Please feel free to open issues or submit pull requests.
Contact
For any inquiries or feedback, please contact KabilPreethamK.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tubelearns-1.1.2.tar.gz
.
File metadata
- Download URL: tubelearns-1.1.2.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71dd0d3a7f04f734a9867350ff9f801e9f8c30920d56f119bdb4851bcf01250c |
|
MD5 | 9da60dab04640d5453d929b3a67c43c9 |
|
BLAKE2b-256 | dbe7b63a235bf4658f0f26b7a1638d9bd1236c149247ed93ce9fe2dc5e21087a |
File details
Details for the file tubelearns-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: tubelearns-1.1.2-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac4a92f2d6a12977508979a2ee94ae99a93ebf80ca5cb6049b4ecb87da8949db |
|
MD5 | 4e3e904f852797d3fff9621b53514c0d |
|
BLAKE2b-256 | 615e32be76309abba3c9180d0825ec196247df542c62906d0b5af252f2de1a4a |