Skip to main content

A NLP based model that ranks transcripts from a given YouTube video with the help of TF-IDF scores from a single trancript.

Project description

Sapphire

Made by CSX Labs License - MIT Stage - Beta GitHub Workflow Status

An NLP based YouTube video scoring model

Project Logo

Table of Contents

Mission

To score YouTube videos based on information rigor.

Resources

Interested in contributing to or testing this project? Check out our contributing guide.

For a more in-depth breakdown of the requirements, see our requirements.

Overview

Sapphire is an innovative open source tool that leverages TF-IDF scoring and frequency distributions to score transcripts of a YouTube video. By analyzing the statistics of important words in the transcript, Sapphire is able to determine the depth and Emphasis covered in a video, making it an effective evaluation system for YouTube videos. This unique tool aims to bridge the gap between text scoring and token analysis, offering a powerful solution for content creators and researchers alike.

Why?

There are several commonly used algorithms and software packages that utilize Natural Language Processing (NLP) scoring techniques to sequence and score YouTube videos. However, many of these algorithms do not take into consideration important factors, such as the weightage of vocabulary in the video's transcript, when determining the level of emphasis provided. Currently, YouTube primarily relies on the "watch time" algorithm, which considers viewership statistics to score videos. While this algorithm is useful for determining based on video popularity, it is not designed to assess comprehensiveness. Therefore, there is a need for an algorithm that can analyze video content and accurately determine the level of emphasis based on factors such as vocabulary richness and depth of the transcript to evaluate a YouTube video.

Requirements

Vision: An algorithm that scores YouTube videos based on information rigor

Youtube transcription

Sapphire needs to retrieve the content of a YouTube video. Here are the steps below of how it is performed.

  • Raw transcript of the YouTube video is obtained using wrappers such as youtube-transcript-api.
  • The obtained raw transcript is sent for pre-processing for refinement.
  • Pre-processing involves eliminating unnecessary elements such as timestamps, speaker labels, and noise from the transcript.

Text pre-processing

  • To produce an accurate score for the YouTube video, the text needs to undergo preprocessing and polishing.
  • The preprocessing involves cleaning the text by removing special characters and converting it to lowercase for consistency.
  • After preprocessing, the text is tokenized into individual words or phrases to prepare for TF-IDF (Term Frequency-Inverse Document Frequency) analysis.
  • In addition to text processing, the length of the video is checked to determine the portion of the transcript that should be used.
  • Based on the optimal bounds provided, the most important part of the video's transcript is selected for further evaluation.
  • The selected portion of the transcript is then subjected to vocabulary weighting to assess its significance and impact.

TF-IDF evaluators

TF-IDF (Term Frequency-Inverse Document Frequency) evaluation is a crucial part of the algorithm for scoring YouTube videos based on their content.TF-IDF involves assigning scores to words based on their contribution to the context of the text.

  • Traditionally, TF-IDF is calculated using documents or transcripts from multiple sources to determine the Inverse Document Frequency (IDF) part of the score.
  • In this algorithm, a simplified approach is used by performing lexical analysis on the transcript of the YouTube video itself.
  • The process begins by tokenizing the pre-processed transcript into sentences.
  • The algorithm analyzes the words within these sentences and scores them based on their frequency (Term Frequency or TF) within the specific video's context.
  • The words are also scored based on their rarity (Inverse Document Frequency or IDF) within that transcript.
  • This allows the algorithm to determine the weight or significance of each word in the given transcript.
  • TF-IDF scores assigned to the words are used to determine their importance or contribution to the content of the video.
  • Words with higher TF-IDF scores are considered more relevant and significant.
  • Words with lower scores are considered less important.
  • TF-IDF evaluation helps in evaluating YouTube videos based on the emphasis or relevance of their content in a specific context.

Score Assessment

  • The evaluation process comprises several steps, and the second most crucial step involves the meticulous implementation of an assessor.
  • The primary role of the assessor is to filter out words based on their TF-IDF (Term Frequency-Inverse Document Frequency) score, a metric that quantifies the importance of a word in a document relative to a corpus of other documents.
  • To enhance the efficiency of the TF-IDF method, the documents undergo preprocessing by means of tokenization, which involves splitting them into individual sentences.
  • The assessor establishes an average standard for the word weight score, acting as a threshold for determining which words are deemed valuable and which ones are deemed unnecessary.
  • Words with a score higher than the average are retained for further analysis, while those with a score equal to or lower than the average are discarded.
  • The Keyword Qualifier plays a vital role in the ranking process by aiding in the determination of the average score.
  • The Keyword Qualifier ensures that only words meeting or surpassing the average score are considered for further analysis, while words below the average are eliminated from consideration.
  • This approach is highly significant as it systematically eliminates superfluous words, enabling the scoring process to concentrate solely on words that possess value and relevance.
  • By excluding words with lower scores, the assessing process can prioritize words that are more pertinent and informative within the context of the given task, such as keyword ranking for search engine optimization or text classification for information retrieval.
  • To summarize, the meticulous implementation of an assessor, incorporating a Keyword Qualifier that sets an average standard for the TF-IDF score and filters out words based on this threshold, constitutes a critical step in the evaluation process.
  • This step ensures that only words of substantive value are considered, thereby resulting in more precise, accurate, and relevant calculations.

Current Status

The current status of the package is that it is in its Beta stage. The evaluations are currently being optimized to enhance performance. Notably, Sapphire now possesses the capability to evaluate transcripts of video lengths up to one day in mere seconds.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csxl_sapphire-0.0.1.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

CSXL_Sapphire-0.0.1-py3-none-any.whl (49.7 kB view details)

Uploaded Python 3

File details

Details for the file csxl_sapphire-0.0.1.tar.gz.

File metadata

  • Download URL: csxl_sapphire-0.0.1.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for csxl_sapphire-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6da4560bb3aa5102a2b9929e208a366e5ebeaacf839641430a3d08d2ec8f165c
MD5 fc0a9ddad660c64258ddc64d265e2e63
BLAKE2b-256 fb56b5192cf4c4d2e56a8d9d7385ccae6b3f03a130270bf5ca56b23f18d6f24d

See more details on using hashes here.

File details

Details for the file CSXL_Sapphire-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for CSXL_Sapphire-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1d410385716c5448070e960e736922bcdee5466fec6f0869dcf5cadf11e79ddc
MD5 cbfc1670e5a742a6a0a042e008d3bbb7
BLAKE2b-256 3b6df5f910e00ef95013fdcf3fda8afa38e14f0f42a6f0c0c554f6f621f437d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page