Skip to main content

Clark University, Package for YouTube crawler and cleaning data

Project description

videolab-youtube-crawler

This is an integrated tool for crawling YouTube data and decode JSON. See the use cases for examples.

This tool is developed by Shuo Niu at Clark University.

Pypi page: "https://pypi.org/project/videolab-youtube-crawler/"

Installing

To install,

pip install videolab-youtube-crawler

The crawler needs multiple other packages to function:

You may need to install those packages:

pip install youtube-transcript-api

pip install google-api-python-client

pip install yt-dlp

pip install deepmultilingualpunctuation

Also install the packages if you don't have them in your vent

pip install pytz

pip install pandas

pip install asyncio

pip install isodate

pip install configparser

pip install httplib2

Upgrading

To upgrade

pip install videolab-youtube-crawler --upgrade

Go to the project folder, delete config.ini if it is already there.

YouTube API Key

  • Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
  • In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
  • You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
  • The crawler will use up all quotas of one key and try next one, until all quotas are used up.

Example usage

Case 1: crawl videos by keywords,

import videolab_youtube_crawler as ytcrawler

# create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.VideoSearcher()
searcher.search("Generative AI",
                start_day=1,
                start_month=1,
                start_year=2023,
                end_day=31,
                end_month=12,
                end_year=2023,
                day_span=30)
searcher.search("Artificial Intelligence",
                start_day=1,
                start_month=1,
                start_year=2023,
                end_day=31,
                end_month=12,
                end_year=2023,
                day_span=30)
                
searcher.merge_to_workfile(destination="DATA/list_video.csv")

# Get all video and channel information
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile=f"DATA/list_video.csv")
video_crawler.json_to_csv()

# Get all channel information from the list generated by merge_to_workfile()
channel_crawler = ytcrawler.ChannelCrawler()
channel_crawler.crawl_channel_of_videos(video_list_workfile=f"DATA/list_video.csv")
channel_crawler.json_to_csv()

Case 2: Search a list of channels by search keys, then crawl all videos belonging to those channels.

import videolab_youtube_crawler as ytcrawler

# Create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.ChannelSearcher()
searcher.search("ADHD")
searcher.search("PTSD")
searcher.merge_to_workfile(destination="DATA/channel_list.csv")

# Get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="DATA/channel_list.csv")
channel_crawller.json_to_csv()

# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='DATA/channel_list.csv')

# Merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()

# Search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()

Case 3: You already have a list of channels. You want to crawl all videos of the channels in the list:

import pandas as pd

import videolab_youtube_crawler as ytcrawler

ytcrawler.init_config()

# Get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="speech_disability_channels.csv")
channel_crawller.json_to_csv()

# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='speech_disability_channels.csv')

# Merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()

# Search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()

To collect comments, subtitles, and streams of the videos

# Get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=2)
comment_crawler.json_to_csv()

# Get all subtitles from the list generated by merge_to_workfile()
# Since the default YouTube cpations may not contain punctuations, use the split_caption_by_sentences to add punctuations and split 
# the captions by sentences. It will automatically align the timestamps.
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list(videos_to_collect="DATA/list_video.csv")

# Collect video streams.
video_downloader = ytcrawler.VideoDownloader()
video_downloader.download_videos_in_list(video_list_workfile="DATA/list_video.csv", quality='worst', audio=True)
video_downloader.clean_files()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

videolab_youtube_crawler-1.2.3.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

videolab_youtube_crawler-1.2.3-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file videolab_youtube_crawler-1.2.3.tar.gz.

File metadata

File hashes

Hashes for videolab_youtube_crawler-1.2.3.tar.gz
Algorithm Hash digest
SHA256 6243f22374e851abe754d113fc6516e4ad1e6d31426a17dba35da6fd626c8b0f
MD5 0351a211173af740726e75d1beed3e31
BLAKE2b-256 db123091a7e8a6b117201cb99ac42352f16145c7e0350f35016631fae9876d81

See more details on using hashes here.

File details

Details for the file videolab_youtube_crawler-1.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for videolab_youtube_crawler-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 399d28d691e0eb95031255794640a996ddc1b907d66ac2910c6315e8254347de
MD5 b663e9324e020e43e15ec2f96e681ace
BLAKE2b-256 7d8b27e1fc0d90406603514fcf5dddc9cc348778f5cada4880964ca9e19f5bd5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page