Clark University, Package for YouTube crawler and cleaning data
Project description
clarku-youtube-crawler
Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS
Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"
Installing
To install,
pip install clarku-youtube-crawler
The crawler needs multiple other packages to function.
If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt
.
Navigate to the folder where it contains requirements.txt and run
pip install -r requirements.txt
YouTube API Key
- Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
- In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
- You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
- The crawler will use up all quotas of one key and try next one, until all quotas are used up.
Example usage
Case 1: crawl videos by keywords,
# your_script.py
import clarku_youtube_crawler as cu
test = cu.RawCrawler()
test.build()
test.crawl("any_keyword",start_date=14, start_month=12, start_year=2020, day_count=2)
test.merge_to_workfile()
test.crawl_videos_in_list(comment_page_count=1)
test.merge_all(save_to='FINAL_merged_all.json')
Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV
import clarku_youtube_crawler as cu
test = cu.RawCrawler()
test.build()
test.crawl_videos_in_list(video_list_workfile="video_list.csv", comment_page_count=1, search_key="any_key")
test.merge_all(save_to='all.json')
Case 3: Get a list of channels crawled by keywords. I.e. get all videos in channels who have a video in a specific topic. You need to run case 1 first to generate a raw video JSON file (FINAL_merged_all.json). This code will find all unique channels in the video JSON and generate a list of channels. Then it will get all videos of all channels.
import clarku_youtube_crawler as cu
channel = cu.ChannelCrawler()
channel.build()
channel.setup_channel(filename='FINAL_merged_all.json', subscriber_cutoff=0, keyword="")
channel.crawl()
channel.crawl_videos_in_list(comment_page_count=3) #100 comments per page, 3 page will crawl first 300 comments
channel.merge_all(save_to='file_of_merged_json.json')
Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:
import clarku_youtube_crawler as cu
channel = cu.ChannelCrawler()
channel.build()
channel.crawl(filename='channel_list.csv', channel_header="channelId")
channel.crawl_videos_in_list(comment_page_count=3) #100 comments per page, 3 page will crawl first 300 comments
channel.merge_all(save_to='file_of_merged_json.json')
To crawl all subtitles of videos in a CSV list:
import clarku_youtube_crawler as cu
subtitle=SubtitleCrawler()
subtitle.build()
subtitle.crawl_csv(filename='video_list.csv', video_header='videoId')
To convert the video JSONs to csv:
jsonn = cu.JSONDecoder()
jsonn.load_json("FINAL_merged_all.json")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for clarku_youtube_crawler-1.3.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c8c7cedadf214d51c83233c81447314ed0ebc6a405411c57791c4275f52e8ab |
|
MD5 | f7e822dce0ae0b1c47dd6c83166bbe60 |
|
BLAKE2b-256 | ccb8c02cc94c99186bf13eee686edbc5bf7ffe5f7f0a2b08ae8cfe521fd3c40d |
Hashes for clarku_youtube_crawler-1.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbdabff2b7f591ce385e06b816be4e1809d4f7608e188836b8ede5d6865054f5 |
|
MD5 | 2e6a0e45b9f271b0fed14b62e4dcda76 |
|
BLAKE2b-256 | 7218f9db9b65463e96c6fe1a8dbc9ff89424661fcbdacf5764b8467fefd4fc67 |