Clark University, Package for YouTube crawler and cleaning data
Project description
clarku-youtube-crawler
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/videolab-youtube-crawler/"
Installing
To install,
pip install videolab-youtube-crawler
The crawler needs multiple other packages to function:
You may need to install those packages:
pip install youtube-transcript-api
pip install google-api-python-client
pip install yt-dlp
Also install the packages if you don't have them in your vent
pip install pytz
pip install pandas
pip install asyncio
pip install isodate
pip install configparser
pip install httplib2
Upgrading
To upgrade
pip install videolab-youtube-crawler --upgrade
Go to the project folder, delete config.ini if it is already there.
YouTube API Key
- Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
- In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
- You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
- The crawler will use up all quotas of one key and try next one, until all quotas are used up.
Example usage
Case 1: crawl videos by keywords,
import videolab_youtube_crawler as ytcrawler
# create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.VideoSearcher()
searcher.search("Human Computer Interaction",
start_day=24,
start_month=8,
start_year=2022,
end_day=26,
end_month=8,
end_year=2022)
searcher.search("Artificial Intelligence",
start_day=24,
start_month=9,
start_year=2022,
end_day=26,
end_month=9,
end_year=2022)
searcher.merge_to_workfile(destination="DATA/list_video.csv")
# get all video and channel information
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile=f"DATA/list_video.csv")
video_crawler.json_to_csv()
# get all channel information from the list generated by merge_to_workfile()
channel_crawler = ytcrawler.ChannelCrawler()
channel_crawler.crawl_channel_of_videos(video_list_workfile=f"DATA/list_video.csv")
channel_crawler.json_to_csv()
# get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=2)
comment_crawler.json_to_csv()
# get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
Case 2: Search a list of channels by search keys, then crawl all videos belonging to those channels.
import videolab_youtube_crawler as ytcrawler
# create folders for search videos
ytcrawler.init_config()
searcher = ytcrawler.ChannelSearcher()
searcher.search("ADHD")
searcher.search("PTSD")
searcher.merge_to_workfile(destination="DATA/channel_list.csv")
# get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="DATA/channel_list.csv")
channel_crawller.json_to_csv()
# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='DATA/channel_list.csv')
# merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()
#
# # search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()
#
# get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=2)
comment_crawler.json_to_csv()
# get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
Case 3: You already have a list of channels. You want to crawl all videos of the channels in the list:
import pandas as pd
import videolab_youtube_crawler as ytcrawler
ytcrawler.init_config()
# get channel metadata in a given channel list
channel_crawller = ytcrawler.ChannelCrawler()
channel_crawller.crawl_channel_in_list(channel_list_workfile="speech_disability_channels.csv")
channel_crawller.json_to_csv()
# This step usually has too many videos. Filter the channel before crawling.
channel_crawller.crawl_all_videos_of_channels(channel_list_workfile='speech_disability_channels.csv')
# merge the searched videos into a workfile
video_searcher = ytcrawler.VideoSearcher()
video_searcher.merge_to_workfile()
# search all videos in the workfile
video_crawler = ytcrawler.VideoCrawler()
video_crawler.crawl_videos_in_list(video_list_workfile='DATA/list_video.csv')
video_crawler.json_to_csv()
# # get all comments from the list generated by merge_to_workfile()
comment_crawler = ytcrawler.CommentCrawler()
comment_crawler.crawl_comments_of_videos(video_list_workfile=f"DATA/list_video.csv", comment_page=10)
comment_crawler.json_to_csv()
#
# # get all subtitles from the list generated by merge_to_workfile()
subtitle_crawler = ytcrawler.SubtitleCrawler()
subtitle_crawler.crawl_subtitles_in_list()
video_downloader = ytcrawler.VideoDownloader()
video_downloader.download_videos_in_list(video_list_workfile="DATA/list_video.csv", quality='worst', audio=False,
batch=200)
video_downloader.download_videos_in_list(video_list_workfile="DATA/list_video.csv", quality='worst', audio=True, video_data_dir='test')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for videolab_youtube_crawler-1.1.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cd03634314c273a4b2a64dfdb51403d3697524be4276f0dbc717994058b8b11 |
|
MD5 | a34945b60a3884c37a6d1c2ab08bfe84 |
|
BLAKE2b-256 | 2331c3a08f2ea57cf829b7360d4619ba6f00fc336fd03fdf46e685f1e3e0da7a |
Hashes for videolab_youtube_crawler-1.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0f7763f69dd965989aa1811ca4bb1a776558d074aa3f005fcf50a7e89ed782b |
|
MD5 | c744a6a1e5e294bfbe07ec53c172d391 |
|
BLAKE2b-256 | 5dc922cf99f5ff6ec572669b5154e0e0dc8435fdc0c15af69231c2362e85176a |