Clark University, Package for YouTube crawler and cleaning data
Project description
clarku-youtube-crawler
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"
Installing
To install,
pip install clarku-youtube-crawler
The crawler needs multiple other packages to function.
If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt
.
Navigate to the folder where it contains requirements.txt and run
pip install -r requirements.txt
Upgrading
To upgrade
pip install clarku-youtube-crawler --upgrade
Go to the project folder, delete config.ini if it is already there.
YouTube API Key
- Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
- In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
- You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
- The crawler will use up all quotas of one key and try next one, until all quotas are used up.
Example usage
Case 1: crawl videos by keywords,
import clarku_youtube_crawler as cu
# Crawl all JSONs
crawler = cu.RawCrawler()
crawler.build("low visibility")
crawler.crawl("low visibility", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.crawl("blind", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.merge_to_workfile()
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all(save_to='low visibility/all_videos.json')
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='low visibility/all_videos.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build("low visibility")
subtitleCrawler.crawl_csv(
videos_to_collect="low visibility/videos_to_collect.csv",
video_id="videoId",
sub_title_dir="low visibility/subtitles/"
)
Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV
import clarku_youtube_crawler as cu
crawler = cu.RawCrawler()
work_dir = "blind"
crawler.build(work_dir)
# update videos_to_collect.csv to your csv file. Specify the column of video id by video_id
# video ids must be ":" + YouTube video id. E.g., ":wl4m1Rqmq-Y"
crawler.crawl_videos_in_list(video_list_workfile="videos_to_collect.csv",
comment_page_count=1,
search_key="blind",
video_id="videoId"
)
crawler.merge_all(save_to='all_raw_data.json')
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='all_raw_data.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect="videos_to_collect.csv",
video_id="videoId",
sub_title_dir=f"YouTube_CSV/subtitles/"
)
Case 3: Search a list of channels by search keys, then crawl all videos belonging to those channels.
import clarku_youtube_crawler as cu
chCrawler = cu.ChannelCrawler()
work_dir = "low visibility"
chCrawler.build(work_dir)
# You can search different channels. All results will be merged
chCrawler.search_channel("low visibility")
chCrawler.search_channel("blind")
chCrawler.merge_to_workfile()
chCrawler.crawl()
# Crawl videos posted by selected channels. channels_to_collect.csv file has which search keys find each channel
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos_visibility.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir+"/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir+"/subtitles/"
)
Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:
import clarku_youtube_crawler as cu
work_dir = 'disability'
chCrawler = cu.ChannelCrawler()
chCrawler.build(work_dir)
# Change channels_to_collect.csv to your own file and specify the channelId by channel_header.
# Channel ids are original YouTube channel ID.
# This function downloads all videos of the provided channels
chCrawler.crawl(filename='channels_to_collect.csv', channel_header="channelId",
comment_page_count=3) # 100 comments per page, 3 page will crawl first 300 comments
# Crawl videos posted by the channels
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
# crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')
# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir + "/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir + "/subtitles/"
)
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')
# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir + "/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir + "/subtitles/"
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for clarku_youtube_crawler-2.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | deb44e40838e23f7b72489ab48000348c9d09ce1c7a6e6a3aa56a8ab523362ec |
|
MD5 | 98ee998f3d3a95d0514f6930c6cf9b07 |
|
BLAKE2b-256 | 6500bbf0df7b987f8851d94c2a95c8ff8cf72dd72176bfaf9622b72bee393cb1 |
Hashes for clarku_youtube_crawler-2.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a43da66b3de1b4cf77759f84056ca56a627ca0a41d79c84052994c65aa98e78e |
|
MD5 | 8266bf92800ecc3fd85d4b586f27e1cc |
|
BLAKE2b-256 | 72d4abd19f3318c510d65750796f1f2b8fe709a3af1723581c2dbe6fbbc4ffc6 |