Skip to main content

Clark University, Package for YouTube crawler and cleaning data

Project description

clarku-youtube-crawler

Clark University YouTube crawler and JSON decoder for YouTube json. Please read documentation in DOCS

Pypi page: "https://pypi.org/project/clarku-youtube-crawler/"

Installing

To install,

pip install clarku-youtube-crawler

The crawler needs multiple other packages to function. If missing requirements (I already include all dependencies so it shouldn't happen), download requirements.txt . Navigate to the folder where it contains requirements.txt and run

pip install -r requirements.txt

Upgrading

To upgrade

pip install clarku-youtube-crawler --upgrade

Go to the project folder, delete config.ini if it is already there.

YouTube API Key

  • Go to https://cloud.google.com/, click console, and create a project. Under Credentials, copy the API key.
  • In your project folder, create a "DEVELOPER_KEY.txt" file (must be this file name) and paste your API key.
  • You can use multiple API keys by putting them on different lines in DEVELOPER_KEY.txt.
  • The crawler will use up all quotas of one key and try next one, until all quotas are used up.

Example usage

Case 1: crawl videos by keywords,

import clarku_youtube_crawler as cu

# Crawl all JSONs
crawler = cu.RawCrawler()
crawler.build("low visibility")
crawler.crawl("low visibility", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.crawl("blind", start_date=14, start_month=12, start_year=2020, day_count=5)
crawler.merge_to_workfile()
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all(save_to='low visibility/all_videos.json')

# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='low visibility/all_videos.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build("low visibility")
subtitleCrawler.crawl_csv(
    videos_to_collect="low visibility/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir="low visibility/subtitles/"
)

Case 2: crawl a videos by a list of ids specified by videoId column in an input CSV

import clarku_youtube_crawler as cu

crawler = cu.RawCrawler()
work_dir = "blind"
crawler.build(work_dir)

# update videos_to_collect.csv to your csv file. Specify the column of video id by video_id
# video ids must be ":" + YouTube video id. E.g., ":wl4m1Rqmq-Y"

crawler.crawl_videos_in_list(video_list_workfile="videos_to_collect.csv",
                             comment_page_count=1,
                             search_key="blind",
                             video_id="videoId"
                             )
crawler.merge_all(save_to='all_raw_data.json')
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file='all_raw_data.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect="videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=f"YouTube_CSV/subtitles/"
)

Case 3: Search a list of channels by search keys, then crawl all videos belonging to those channels.

import clarku_youtube_crawler as cu

chCrawler = cu.ChannelCrawler()
work_dir = "low visibility"
chCrawler.build(work_dir)
# You can search different channels. All results will be merged
chCrawler.search_channel("low visibility")
chCrawler.search_channel("blind")
chCrawler.merge_to_workfile()
chCrawler.crawl()

# Crawl videos posted by selected channels. channels_to_collect.csv file has which search keys find each channel
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=1)
crawler.merge_all()

# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos_visibility.json')

# Crawl subtitles from CSV
# If you don't need subtitles, delete the following lines
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect=work_dir+"/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=work_dir+"/subtitles/"
)

Case 4: You already have a list of channels. You want to crawl all videos of the channels in the list:

import clarku_youtube_crawler as cu

work_dir = 'disability'
chCrawler = cu.ChannelCrawler()
chCrawler.build(work_dir)

chCrawler.crawl(filename='mturk_test.csv', channel_header="Input.channelId")

# Crawl videos posted by selected channels
crawler = cu.RawCrawler()
crawler.build(work_dir)
crawler.merge_to_workfile(file_dir=work_dir + "/video_search_list/")
crawler.crawl_videos_in_list(comment_page_count=10)  # 100 comments per page, 10 page will crawl 1000 comments

crawler.merge_all()
#
# Convert JSON to CSV
decoder = cu.JSONDecoder()
decoder.json_to_csv(data_file=work_dir + '/all_videos.json')

# Crawl subtitles from CSV
subtitleCrawler = cu.SubtitleCrawler()
subtitleCrawler.build(work_dir)
subtitleCrawler.crawl_csv(
    videos_to_collect=work_dir + "/videos_to_collect.csv",
    video_id="videoId",
    sub_title_dir=work_dir + "/subtitles/"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clarku_youtube_crawler-2.1.3.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

clarku_youtube_crawler-2.1.3-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file clarku_youtube_crawler-2.1.3.tar.gz.

File metadata

  • Download URL: clarku_youtube_crawler-2.1.3.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.27.1 setuptools/60.8.2 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for clarku_youtube_crawler-2.1.3.tar.gz
Algorithm Hash digest
SHA256 be3a0072b4db082c753eb170f249d82f78690bc5636d30ebb973e3e458a30da5
MD5 c7393dd6945a830534fb5c72675c2af5
BLAKE2b-256 b9629017def4482727c467e8502ac3c0b70d3dded5bd32e1d619a63366df0d8c

See more details on using hashes here.

File details

Details for the file clarku_youtube_crawler-2.1.3-py3-none-any.whl.

File metadata

  • Download URL: clarku_youtube_crawler-2.1.3-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.27.1 setuptools/60.8.2 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for clarku_youtube_crawler-2.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fdd3cdb0adaac5ede7afa621291ddcc092584871678d2e40cd33fd5d16c4ab3f
MD5 40aae42d881bc3a8f16bc6fe1340af18
BLAKE2b-256 03930d83dfbedd816977e6dcc14addacaf05301c01b2cd82f22309b500ff5d58

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page