Skip to main content

Python script for extracting,cleaning and tokenization YouTube video transcripts for Pre-Processing in machine learning.

Project description

Tube-Data: YouTube Video Transcript Extractor

Tube-Data is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.

Features

  • Extracts video transcripts from YouTube videos.
  • Saves cleaned transcripts into separate text files.
  • Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
  • Streamlines the dataset collection process for machine learning applications.
  • New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.

Installation

You can install the required dependencies using pip:

pip install tubelearns

Usage

Extract Transcripts from a List of Video URLs

from tubelearns import text_link

# Provide a path to a text file containing YouTube video URLs.
text_link('path_to_file.txt', name='output_folder_name')

Extract Transcript from a Single Video URL

from tubelearns import url_grab

# Provide a single YouTube video URL.
url_grab('video_url', name='output_folder_name')

Extract Transcripts from a YouTube Playlist

from tubelearns import playlist_grab

# Provide the URL of a YouTube playlist.
playlist_grab('playlist_url', name='output_folder_name')

Cleaning and Punctuation Removal

from tubelearns import Cleaning

# Initialize the Cleaning class
cleaner = Cleaning()

# Clean and remove punctuation from text
content = "Hey! hope you good"
cleaned_text = cleaner.punct_raw(content)
print(cleaned_text)

Tokenization

from tubelearns import Tokenization

# Initialize the Tokenization class
tokenizer = Tokenization()

# Tokenize text
content = "Hello sam. How are you."
tokenized_text = tokenizer.tokenize_raw(content)
print(tokenized_text)

Development Status

This project is currently in the planning stage.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions

Contributions are welcome! Please feel free to open issues or submit pull requests.

Contact

For any inquiries or feedback, please contact KabilPreethamK.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubelearns-1.1.2.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

tubelearns-1.1.2-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file tubelearns-1.1.2.tar.gz.

File metadata

  • Download URL: tubelearns-1.1.2.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.2.tar.gz
Algorithm Hash digest
SHA256 71dd0d3a7f04f734a9867350ff9f801e9f8c30920d56f119bdb4851bcf01250c
MD5 9da60dab04640d5453d929b3a67c43c9
BLAKE2b-256 dbe7b63a235bf4658f0f26b7a1638d9bd1236c149247ed93ce9fe2dc5e21087a

See more details on using hashes here.

File details

Details for the file tubelearns-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: tubelearns-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for tubelearns-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ac4a92f2d6a12977508979a2ee94ae99a93ebf80ca5cb6049b4ecb87da8949db
MD5 4e3e904f852797d3fff9621b53514c0d
BLAKE2b-256 615e32be76309abba3c9180d0825ec196247df542c62906d0b5af252f2de1a4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page