Extract YouTube video titles and URLs with end-to-end web scraping API + automate Selenium webdriver dependency set up

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Python Quick Start

Python 3.6+ setup (required if not already installed)

This package uses f-strings (more here), and so requires Python 3.6+.

If you have an older version of Python, you can download Python 3.9.1 (follow links below) and follow the instructions to set up Python for your machine. If you want to install a different version, visit the Python Downloads page and select the version you want.

macOS 64-bit installer
Windows x86-64 executable installer
Windows x86 executable installer
Gzipped source tarball (most useful for Linux)

Permissions for first run

This is required to make sure you can download and install the required Selenium binary dependencies.

On Windows: makes sure you open Command Prompt or Powershell (both work) in "Run as Administrator" mode

shortcut: ⊞ Win + X + A

On Unix based machines (MacOS, Linux): make sure you have read and write access to /usr/local/bin/

if you're not sure, open terminal and run sudo chown $USER /usr/local/bin/

Installing the package

After you install Python 3.6+ and ensure you have the required permissions as needed, enter the following in your command line:

# if something isn't working properly, try rerunning this
# the problem may have been fixed with a newer version

pip3 install -U yt-videos-list     # MacOS/Linux
pip  install -U yt-videos-list     # Windows

Running the package from the python interpreter

python3     # MacOS/Linux
python      # Windows

from yt_videos_list import ListCreator


my_driver = 'firefox' # SUBSTITUTE DRIVER YOU WANT (options below)
lc = ListCreator(driver=my_driver, scroll_pause_time=0.8)


lc.create_list_for(url='https://www.youtube.com/user/schafer5')
lc.create_list_for(url='https://www.youtube.com/channel/UC8butISFwT-Wl7EV0hUK0BQ', log_silently=True)
# Set `log_silently` to `True` to mute program logging to the console.
# The program will log the prgram status and any program information
# to only the log file for the channel being scraped
# (this is useful when scraping multiple channels at once with multi-threading).
# By default, the program logs to both the log file for the channel being scraped AND the console.


# see the new files that were just created:
import os
os.system('ls -lt | head')                      # MacOS/Linux
os.system('dir /O-D | find "_videos_list"')     # Windows

# for more information on using the module:
help(lc)

driver options include:
- 'firefox'
- 'opera'
- 'safari' (MacOS only)
- 'chrome'
- 'brave'
- 'edge' (Windows only!)
increase scroll_pause_time for laggy internet and decrease scroll_pause_time for fast internet

If you already scraped a channel and the channel uploaded a new video, simply rerun this program on that channel and this package updates your files to include the newer video(s)!

Scraping multiple channels from a file simultaneously with multi-threading

Add the url to every channel you want to extract information from in a txt file with every url placed on a new line.

e.g. channels.txt

https://www.youtube.com/channel/UCSHZKyawb77ixDdsGog4iWA
https://www.youtube.com/c/WorldScienceFestival/playlists
https://www.youtube.com/c/RSAConference/videos
https://www.youtube.com/channel/UCtC8aQzdEHAmuw8YvtH1CcQ/videos
https://www.youtube.com/channel/UCQSrdt0-Iu8qVEiJyzhrfdQ/videos
https://www.youtube.com/user/TEDxTalks/videos
https://www.youtube.com/user/TEDxYouth
https://www.youtube.com/user/TEDPrizeChannel/videos
https://www.youtube.com/user/TEDInstitute/videos
https://www.youtube.com/user/TEDPartners/channels
https://www.youtube.com/c/TheVerge/channels
https://www.youtube.com/c/mitocw/channels
https://www.youtube.com/c/stanford/channels
https://www.youtube.com/c/khanacademy/channels
https://www.youtube.com/c/TEDEdStudentTalks/channels
https://www.youtube.com/c/TED/channels
https://www.youtube.com/c/TEDFellow/videos
https://www.youtube.com/c/tedededucatortalks/videos
https://www.youtube.com/c/TEDTranslators/videos
https://www.youtube.com/c/TEDEspanol/videos
https://www.youtube.com/teded/featured
https://www.youtube.com/c/IBMSecurity/channels
https://www.youtube.com/user/symantec/channels
https://www.youtube.com/c/QuantamagazineOrgNews/videos
https://www.youtube.com/c/Splunkofficial/channels

Enter the python interpreter:

python3     # MacOS/Linux
python      # Windows

import time
import threading   # python standard library built-in package, no download necessary
from yt_videos_list import ListCreator

my_driver = 'firefox'
lc = ListCreator(driver=my_driver, scroll_pause_time=0.8)

number_of_threads         = 4 # CHANGE TO DESIRED NUMBER OF CONCURRENT THREADS
path_to_channel_urls_file = 'channels.txt'

with open(path_to_channel_urls_file, 'r', encoding='utf-8') as file:
    for url in file:
        while threading.active_count() == number_of_threads + 1: # add 1 since main thread counts as a thread
            time.sleep(5) # wait 5 seconds before checking to see if a previously running thread completed
        thread = threading.Thread(target=lc.create_list_for, args=(url, True))
        thread.start()
    thread.join() # After we iterate through every line in the file, we call the join() method
    # on the last thread so python doesn't exit the multi-threaded environment pre-maturely
    # This is ESSENTIAL, otherwise threading might stop randomly on the last channel in the
    # channels.txt file before the program finishes writing all the channel information to the files!

See Thread about multi-threading with yt_videos_list for more information!

Explicitly downloading all Selenium dependencies

Ideal if you use Selenium for other projects 😎

Make sure you already have the yt-videos-list package installed (follow directions above for getting set up), then run the following:

pip3 install -U yt-videos-list # MacOS/Linux: ensure latest package
python3                        # MacOS/Linux: enter python interpreter
pip install -U yt-videos-list  # Windows:     ensure latest package
python                         # Windows:     enter python interpreter

from yt_videos_list.download import selenium_webdriver_dependencies
selenium_webdriver_dependencies.download_all()

That's all! 🤓

More API information

NOTE that you can also access all the information below from the Python interpreter by entering

import yt_videos_list
help(yt_videos_list)

# default options for the ListCreator object

ListCreator(
  txt=True,
  csv=True,
  md=True,
  reverse_chronological=True,
  headless=False,
  scroll_pause_time=0.8,
  driver='firefox'
  )

There are a number of optional arguments you can specify during the instantiation of the ListCreator object. The preceding arguments are run by default, but in case you want more flexibility, you can specify the:

driver argument:
- Firefox (default)
- Opera
- Safari (MacOS only)
- Chrome
- Brave
- Edge (Windows only)
  - driver='firefox'
  - driver='opera'
  - driver='safari'
  - driver='chrome'
  - driver='brave'
  - driver='edge'
txt, csv, md file type argument:
True (default) - create a file for the specified type
False - do not create a file for the specified type.
- txt=True (default) OR txt=False
- csv=True (default) OR csv=False
- md=True (default) OR md=False
reverse_chronological argument:
- True (default) - write the files in order from most recent video to the oldest video
- False - write the files in order from oldest video to the most recent video
  - reverse_chronological=True (default) OR reverse_chronological=False
headless argument:
- False (default) - run the driver with an open Selenium instance for viewing
- True - run the driver in "invisible" mode.
  - headless=False (default) OR headless=True
scroll_pause_time argument:
- any float values greater than 0 (default 0.8).
  - The value you provide will be how long the program waits before trying to scroll the videos list page down for the channel you want to scrape. For fast internet connections, you may want to reduce the value, and for slow connections you may want to increase the value.
- scroll_pause_time=0.8 (default)
- CAUTION: reducing this value too much will result in the program not capturing all the videos, so be careful! Experiment :)

Cloning and running locally

To clone the repository and install the most updated version of the package that may not yet be available on the latest release through PyPI, run:

git clone https://github.com/Shail-Shouryya/yt_videos_list.git

cd yt_videos_list/python # MacOS/Linux
pip3 install .           # MacOS/Linux

cd yt_videos_list\python # Windows
pip install .            # Windows

To make your own changes to the yt_videos_list python package and run the changes locally:

# make changes to the codebase in the
# ===> /dev <=== directory
python3 minifier.py           # MacOS/Linux
pip3 install .                # MacOS/Linux

python minifier.py            # Windows
pip install .                 # Windows

NOTE that the changes you make to the codebase SHOULD BE MADE in the yt_videos_list/python/dev directory!!

the code in the yt_videos_list/python/yt_videos_list directory is minified with
- leading indents stipped to the minimum (1 space for each nested scope)
- whitespace for padding (e.g. extra spaces to align variable assignments) stripped
- comments stripped
as a result, the code in the yt_videos_list/python/yt_videos_list directory is NOT human readable, and the yt_videos_list/python/dev directory should be used for development instead!
- the minifier.py module performs all the code preprocessing and packages the code from yt_videos_list/python/dev into the final version seen in the yt_videos_list/python/yt_videos_list directory
- so running minifier.py before installing the local package with pip install . (Windows) or pip3 install . is essential!

Running tests

Make sure you're in the yt_videos_list/python directory, then run:

tests\run_tests.bat     # Windows
####       Any shell on   MacOS/Linux
bash tests/run_tests.sh # this works
csh  tests/run_tests.sh # this works
dash tests/run_tests.sh # this works
ksh  tests/run_tests.sh # this also works
tcsh tests/run_tests.sh # this works too
zsh  tests/run_tests.sh # this works as well
# you can try other shells and
# they should work too, since
# there's no special syntax in
# the run_tests.sh file

Stargazers Over Time

Usage Statistics

Back to main page

If you found this interesting or useful, please consider starring this repo so other people can more easily find and use this. Thanks!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.7

Nov 11, 2023

0.6.6

Dec 5, 2022

0.6.5

Nov 28, 2022

0.6.4

Aug 10, 2022

0.6.3

Nov 28, 2021

0.6.2

Sep 12, 2021

0.6.1

Sep 7, 2021

0.6.0

Jul 19, 2021

0.5.9

Jun 28, 2021

0.5.8

May 24, 2021

0.5.7

May 17, 2021

0.5.6

May 10, 2021

0.5.5

Apr 26, 2021

0.5.4

Feb 22, 2021

0.5.3

Feb 1, 2021

This version

0.5.2

Jan 9, 2021

0.5.1

Jan 6, 2021

0.5.0

Jan 5, 2021

0.4.7

Nov 2, 2020

0.4.6

Oct 6, 2020

0.4.5

Sep 5, 2020

0.4.4

Aug 14, 2020

0.4.3

Jul 5, 2020

0.4.2

Jun 15, 2020

0.4.1

Jun 10, 2020

0.4.0

Jun 3, 2020

0.3.9

May 28, 2020

0.3.8

May 14, 2020

0.3.7

May 10, 2020

0.3.6

May 8, 2020

0.3.5

May 6, 2020

0.3.4

May 4, 2020

0.3.3

May 3, 2020

0.3.2

May 3, 2020

0.3.1

Apr 26, 2020

0.3.0

Apr 25, 2020

0.2.17

Apr 25, 2020

0.2.16

Dec 1, 2019

0.2.15

Dec 1, 2019

0.2.14

Nov 27, 2019

0.2.13

Nov 27, 2019

0.2.12

Nov 25, 2019

0.2.11

Nov 25, 2019

0.2.10

Nov 25, 2019

0.2.9

Nov 22, 2019

0.2.8

Nov 22, 2019

0.2.7

Nov 21, 2019

0.2.6

Nov 19, 2019

0.2.5

Nov 19, 2019

0.2.4

Nov 15, 2019

0.2.3

Nov 14, 2019

0.2.2

Nov 12, 2019

0.2.1

Nov 7, 2019

0.1.6

Nov 7, 2019

0.1.5

Oct 24, 2019

0.1.4

Oct 18, 2019

0.1.3

Oct 6, 2019

0.1.2

Sep 24, 2019

0.1.1

Sep 24, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_videos_list-0.5.2.tar.gz (30.9 kB view hashes)

Uploaded Jan 9, 2021 Source

Hashes for yt_videos_list-0.5.2.tar.gz

Hashes for yt_videos_list-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`9cc0c7b3c85baa8e1e1756c1187694cbd988211e723d085da9bae35a24a29ae9`
MD5	`81fa011268cb1f9a67e7cc989fed4a27`
BLAKE2b-256	`2bba2ec568086a996c00fe21776f0eeae0d950e5a5b60f615a922888e0f31ac6`