Skip to main content

Interacting with YouTube Data API and running analysis using NLP

Project description

pyYouTubeAnalysis

Interaction with the YouTube API to pull data and run analysis using statistics and Natural Language Processing (NLP). Contains NLP implementations of text cleaning specific to social media data noise, key-phrase extraction using NLTK and Named-entity Recognition (NER) on a list of strings. Contains automatic plots, word clouds, and analysis report pdf generation.

Setup

  1. Use pip
pip install pyYouTubeAnalysis

and run

python -m spacy download en_core_web_sm

Or, 2. Clone the project from github and run the following for setup.

git clone git@github.com:jsingh811/pyYouTubeAnalysis.git
cd pyYouTubeAnalysis
pip install -e .
python -m spacy download en_core_web_sm

Demos

To see YouTube data extraction examples, see the section YouTube Data Fetching.

To see NER extraction examples, see the section Extracting Locations.

To see Key-phrase extraction examples, see the section Extracting Keyphrases from Text.

To see data cleaning examples for removing emojis and URLs from text, see the section Removing Emojis and URLs from Text.

To see report generation with statistical and NLP analysis, see the section Report Generation.

YouTube Data Fetching

Command Line Usage

python run_crawl.py -t "<YouTube API key (39 chars long)>" -k "travel vlog" -sd "2020-01-01T00:00:00Z" -ed "2020-01-02T00:00:00Z" -climit 5 -path "/Users/abc/Documents"

Input Arguments

path (-path): Path to the directory you want to save the data in
keyword (-k): Keyword to search videos for
start-date (-sd): Starting publish date of video to search. Format YYYY-MM-DDThh:mm:ssZ
end-date (-ed): Ending publish date of video to search. Format YYYY-MM-DDThh:mm:ssZ
token (-t): YouTube API access token
comments (-c): Whether you want to fetch comment text for the videos
comment-limit (-climit): Per video comment limit to fetch

Import and Use

import json
from pyYouTubeAnalysis import (run_crawl, crawler)

keyword = "travel"
start_date = "2020-01-01T00:00:00Z"
end_date = "2020-01-02T00:00:00Z"
comment_limit = 5
api_token = "<YouTube API key (39 chars long)>"
path = "/Users/abc/Documents"
api = crawler.YouTubeCrawler(key=api_token)

# Fetch data from the api
[videos, comments] = run_crawl.get_videos_and_comments(
     api,
     keyword=keyword,
     start_date=start_date,
     end_date=end_date,
     comment_limit=comment_limit
)

# Save the fetched data on disk
with open("/".join([
    path,
    "_".join([
        keyword,
        start_date.replace(":", ""),
        end_date.replace(":", ""),
        "video_details.json"
    ])
]), "w") as f:
      json.dump(videos, f, indent=2)
with open("/".join([
    path,
    "_".join([
        keyword,
        start_date.replace(":", ""),
        end_date.replace(":", ""),
        "comment_details.json"
    ])
]), "w") as f:
      json.dump(comments, f, indent=2)

Sample output

The data inside ...video_details.json file that generates is a list of dictionaries, of the following format as shown in this file.

The data inside ...comment_details.json file that generates is a list of dictionaries, of the following format as shown in this file.

Extracting Locations

The following contains examples for extracting location from comments file generated above.

Command Line Usage

Assuming you are in the parent folder pyYouTubeAnalysis after cloning and setting up the project, the following sample command can be used. Please alter -filepath accordingly.

python extract_locations.py -filepath "/Users/abc/Documents/travel_comment_details.json"

Import and Use

from pyYouTubeAnalysis import extract_locations

filepath = "/Users/abc/Documents/travel_comment_details.json"

comments = extract_locations.read_comment_text(filepath)
locations = extract_locations.extract_locations(comments)

Sample output

The data inside locations_....json file that generates using the command line usage example, or the variable locations in the import and use example is a dictionary of location names as keys and their occurrence counts as values of the format as shown in this file.

Extracting Keyphrases from Text

Import and Use

from pyYouTubeAnalysis.phrases import KeyPhraseGenerator

documents = [
            """Did you know about this conference in Miami? It is about Natural
            Language Processing techniques as applied to messy data.""",
            "I really enjoyed the chocolate cheesecake yesterday!"
]

kp = KeyPhraseGenerator()

phrases =  kp.extract_keyphrases(documents)

Removing Emojis and URLs from Text

Import and Use

from pyYouTubeAnalysis import cleaner

document = " emoji was here -> 😃 , and url was here -> https://github.com"

# remove emoji
emoji_removed = cleaner.remove_emojis(document)

# removing url 
url_removed = cleaner.remove_urls(document)

Report Generation

This functionality allows the user to crawl YouTube and gather stats related plots, wordclouds and location analysis in one pdf. The files generated as a part of this can be found in this folder.

Command Line Usage

Assuming you are in the parent folder pyYouTubeAnalysis after cloning and setting up the project, the following sample command can be used. Please alter -path accordingly.

python report.py -path "/Users/abc/Documents" -k "travel vlog" -sd "2020-01-01T00:00:00Z" -ed "2021-03-31T00:00:00Z" -analysis "monthly,yearly"  -t "<YouTube API key (39 chars long)>"```  

Import and Use

from pyYouTubeAnalysis.report import ReportGenerator
from pyYouTubeAnalysis import run_crawl, crawler

keyword = "travel vlog"
start_date =  "2020-01-01T00:00:00Z"
end_date = "2021-03-31T00:00:00Z"
analysis_type = ["yearly", "monthly"] 
api_token = "<YouTube API key (39 chars long)>"
path = "/Users/abc/Documents"

rgen = ReportGenerator(path, keyword, start_date, end_date, analysis_type)

api = crawler.YouTubeCrawler(key=api_token)
# Fetch data from the api
[videos, comments] = run_crawl.get_videos_and_comments(
    api, keyword=keyword, start_date=start_date, end_date=end_date, comment_limit=10
)
print("\nFetched data\n")
rgen.get_and_plot_stats(videos)
rgen.plot_trending_tags(videos)
rgen.plot_comment_locations(comments)
print("\nFetched plots\n")
output_path = rgen.export_to_pdf()
print("\nGenerated pdf here {}\n".format(output_path))

Citation

Please cite this software as below

APA

Singh, J. (2021). jsingh811/pyYouTubeAnalysis: YouTube Data Requests and Natural Language Processing on Text (v1.1) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5044556

BibTex

@software{https://doi.org/10.5281/zenodo.5044556,
  doi = {10.5281/ZENODO.5044556},
  url = {https://zenodo.org/record/5044556},
  author = {Singh,  Jyotika},
  title = {jsingh811/pyYouTubeAnalysis: YouTube Data Requests and Natural Language Processing on Text},
  publisher = {Zenodo},
  year = {2021},
  copyright = {Open Access}
}

Project details


Release history Release notifications | RSS feed

This version

1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyYouTubeAnalysis-1.1.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

pyYouTubeAnalysis-1.1-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file pyYouTubeAnalysis-1.1.tar.gz.

File metadata

  • Download URL: pyYouTubeAnalysis-1.1.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.8

File hashes

Hashes for pyYouTubeAnalysis-1.1.tar.gz
Algorithm Hash digest
SHA256 29935630e39eb5e995cdc89e81826dec8ea69c36dcc0f952d1a92d989d131a48
MD5 0f9c69903699bb5f64349f74d989ba61
BLAKE2b-256 7514d2d9090fb983aa4c6a46f18639e34c215ab057a9bd23e5c430fa9740c523

See more details on using hashes here.

File details

Details for the file pyYouTubeAnalysis-1.1-py3-none-any.whl.

File metadata

  • Download URL: pyYouTubeAnalysis-1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.8

File hashes

Hashes for pyYouTubeAnalysis-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a761d9950db78857b480a7cce4b9744e236abbfa12f8ad2b452a91e7ed745bff
MD5 6845c9de0c5f5953b083dc9777ebf6e6
BLAKE2b-256 77fe1f7d4d04735eb6c31d42cf3c56e1f7ef17a843fa54652a2d374c38f0bb1e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page