Clark University, Package for YouTube crawler and cleaning data
Project description
clarku-youtube-crawler
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"
Installing
To install,
pip install clarku-youtube-crawler
The crawler needs multiple other packages to function.
If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt
.
Navigate to the folder where it contains requirements.txt and run
pip install -r requirements.txt
Upgrading
To upgrade
pip install clarku-youtube-crawler --upgrade
Go to the project folder, delete config.ini if it is already there.
YouTube API Key
- Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
- In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
- You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
- The crawler will use up all quotas of one key and try next one, until all quotas are used up.
Example usage
Case 1: crawl videos by keywords,
import clarku_youtube_crawler as cu
# Crawl all JSONs
crawler = cu.RawCrawler()
crawler.build("low visibility")
crawler.crawl("low visibility", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.crawl("blind", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.merge_to_workfile()
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all(save_to='low visibility/all_videos.json')
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='low visibility/all_videos.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build("low visibility")
subtitleCrawler.crawl_csv(
videos_to_collect="low visibility/videos_to_collect.csv",
video_id="videoId",
sub_title_dir="low visibility/subtitles/"
)
Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV
import clarku_youtube_crawler as cu
crawler = cu.RawCrawler()
work_dir = "blind"
crawler.build(work_dir)
# update videos_to_collect.csv to your csv file. Specify the column of video id by video_id
# video ids must be ":" + YouTube video id. E.g., ":wl4m1Rqmq-Y"
crawler.crawl_videos_in_list(video_list_workfile="videos_to_collect.csv",
comment_page_count=1,
search_key="blind",
video_id="videoId"
)
crawler.merge_all(save_to='all_raw_data.json')
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='all_raw_data.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect="videos_to_collect.csv",
video_id="videoId",
sub_title_dir=f"YouTube_CSV/subtitles/"
)
Case 3: Search a list of channels by search keys, then crawl all videos belonging to those channels.
import clarku_youtube_crawler as cu
chCrawler = cu.ChannelCrawler()
work_dir = "low visibility"
chCrawler.build(work_dir)
# You can search different channels. All results will be merged
chCrawler.search_channel("low visibility")
chCrawler.search_channel("blind")
chCrawler.merge_to_workfile()
chCrawler.crawl()
# Crawl videos posted by selected channels. channels_to_collect.csv file has which search keys find each channel
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos_visibility.json')
# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir+"/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir+"/subtitles/"
)
Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:
import clarku_youtube_crawler as cu
work_dir = 'disability'
chCrawler = cu.ChannelCrawler()
chCrawler.build(work_dir)
chCrawler.crawl(filename='channels_to_collect.csv', channel_header="channelId")
# Crawl videos posted by selected channels (
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=10) # 100 comments per page, 10 page will crawl 1000 comments
crawler.merge_all()
#
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')
# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir + "/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir + "/subtitles/"
)
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')
# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
videos_to_collect=work_dir + "/videos_to_collect.csv",
video_id="videoId",
sub_title_dir=work_dir + "/subtitles/"
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for clarku_youtube_crawler-2.0.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4732ae97ac9f1ab21bf1fbbd989ffee02ca5379cd5c6f4a267f30d4ae8e050a4 |
|
MD5 | 5c025bab73d50a4114dd812b5a755047 |
|
BLAKE2b-256 | 55e3c06db238fff1bb9bccb97edcf6d24117e5cd3055ec315c4d2cb99cd94fbf |
Hashes for clarku_youtube_crawler-2.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | faf8f3b53119caa2398d611d7d89635cb9371e91b035b60b28669f2c859400fe |
|
MD5 | 7678a2f196185f5ba02364bda1e739d9 |
|
BLAKE2b-256 | bbae8e91e967f3f9bcf7c7d3a8d9f662cf659dd27bb012d062aaf9011f009c4c |