Clark University, Package for YouTube crawler and cleaning data
Project description
clarku-youtube-crawler
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/videolab-youtube-crawler/"
Installing
To install,
pip install videolab-youtube-crawler
The crawler needs multiple other packages to function.
If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt
.
Navigate to the folder where it contains requirements.txt and run
pip install -r requirements.txt
Install the packages
youtube_transcript_api: pip install youtube-transcript-api
googleapiclient: pip install google-api-python-client
yt_dlp: pip install yt-dlp
Upgrading
To upgrade
pip install videolab-youtube-crawler --upgrade
Go to the project folder, delete config.ini if it is already there.
YouTube API Key
- Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
- In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
- You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
- The crawler will use up all quotas of one key and try next one, until all quotas are used up.
Example usage
Case 1: crawl videos by keywords,
import videolab_youtube_crawler as ytcrawler
# create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.VideoSearcher()
searcher.search("Human Computer Interaction",
start_day=24,
start_month=8,
start_year=2022,
end_day=26,
end_month=8,
end_year=2022)
searcher.search("Artificial Intelligence",
start_day=24,
start_month=9,
start_year=2022,
end_day=26,
end_month=9,
end_year=2022)
searcher.merge_to_workfile(destination="DATA/list_video.csv")
# get all video and channel information
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile=f"DATA/list_video.csv")
video_crawler.json_to_csv()
# get all channel information from the list generated by merge_to_workfile()
channel_crawler = ytcrawler.ChannelCrawler()
channel_crawler.crawl_channel_of_videos(video_list_workfile=f"DATA/list_video.csv")
channel_crawler.json_to_csv()
# get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=2)
comment_crawler.json_to_csv()
# get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
Case 2: Search a list of channels by search keys, then crawl all videos belonging to those channels.
import videolab_youtube_crawler as ytcrawler
# create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.ChannelSearcher()
searcher.search("ADHD")
searcher.search("PTSD")
searcher.merge_to_workfile(destination="DATA/channel_list.csv")
# get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="DATA/channel_list.csv")
channel_crawller.json_to_csv()
# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='DATA/channel_list.csv')
# merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()
#
# # search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()
#
# get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=2)
comment_crawler.json_to_csv()
# get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
Case 3: You already have a list of channels. You want to crawl all videos of the channels in the list:
import pandas as pd
import videolab_youtube_crawler as ytcrawler
ytcrawler.init_config()
# get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="speech_disability_channels.csv")
channel_crawller.json_to_csv()
# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='speech_disability_channels.csv')
# merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()
# search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()
# # get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=10)
comment_crawler.json_to_csv()
#
# # get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
video_downloader = ytcrawler.VideoDownloader()
video_downloader.download_videos_in_list(video_list_workfile="DATA/list_video.csv", quality='worst', audio=False,
batch=200)
video_downloader.download_videos_in_list(video_list_workfile="DATA/list_video.csv", quality='worst', audio=True, video_data_dir='test')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for videolab_youtube_crawler-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eff0b2516310276cf3b4f7d25d6136a4ecd01d1137c3a8e3e3fad01e8b69981a |
|
MD5 | d78e6defe8ff6ca934cc9255d6249d31 |
|
BLAKE2b-256 | 4cb3c4e329f7e9d11c67c4734ba560c78a8673acea12175b4cbfe7dbadacae82 |