Skip to main content

A Python package to scrape YouTube comments using Selenium and BeautifulSoup

Project description

YouTube Background

YouTube Comment Scraper

YouTubeCommentScraper is a Python package designed to scrape comments from YouTube videos using Selenium. The scraper is customizable, allowing you to run the browser in headless mode, control the timeout, pause time for scrolling, and more. You can also choose whether to log actions and return the page source along with the comments.

Features

  • Headless Mode: Run the browser in headless mode (optional).
  • Customizable Timeouts: Set the timeout for waiting for elements to load.
  • Automatic Scrolling: Automatically scrolls the page until all comments are loaded.
  • Logging Support: Enable logging to a file for tracking activities.
  • Return Page Source: Optionally return the page source along with the comments.
  • BeautifulSoup Integration: Extract comments using BeautifulSoup for robust parsing.

Installation

To install the package, use the following command:

pip install youtube-comments-scrapper

Dependencies

This package requires the following dependencies:

  • selenium
  • webdriver-manager
  • beautifulsoup4
  • lxml (optional but recommended for faster HTML parsing)

You can install these dependencies using the following command (optional):

pip install selenium webdriver-manager beautifulsoup4 lxml

Usage

1. Basic Usage: Scraping Comments

Here's a simple example to scrape comments from a YouTube video:

from youtube_comments_scraper import YouTubeCommentScraper

scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)

print("Comments:", comments)

2. Scraping Comments with Logging Enabled

Enable logging to track the actions performed during scraping:

from youtube_comments_scraper import YouTubeCommentScraper

scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=True, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)

print("Comments:", comments)

This will generate a log file (youtube_scraper.log) in the current directory.

3. Returning Page Source Along with Comments

If you want to extract comments and return the page's HTML source:

from youtube_comments_scraper import YouTubeCommentScraper

scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=True)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments, page_source = scraper.scrape_comments(video_url)

print("Comments:", comments)
print("Page Source:", page_source)

4. Custom Scroll Pause Time

You can control how long the scraper pauses between scroll actions using the scroll_pause_time parameter:

from youtube_comments_scraper import YouTubeCommentScraper

scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=2.0, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)

print("Comments:", comments)

5. Scraping Comments Without Scrolling

If you only want to scrape the comments that load without scrolling:

from youtube_comments_scraper import YouTubeCommentScraper

scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url, scroll=False)

print("Comments:", comments)

6. Logging Custom Messages

You can log custom messages using the built-in log_info, log_warning, and log_error methods:

scraper.log_info("This is an info log message.")
scraper.log_warning("This is a warning message.")
scraper.log_error("This is an error message.")

Class Reference

YouTubeCommentScraper

__init__(self, headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)

  • headless (bool): Run the browser in headless mode. Default is True.
  • timeout (int): The maximum time to wait for elements to load. Default is 10 seconds.
  • scroll_pause_time (float): The pause time between scroll actions. Default is 1.5 seconds.
  • enable_logging (bool): Whether to enable logging to a file. Default is False.
  • return_page_source (bool): Whether to return the page source along with comments. Default is False.

scrape_comments(self, video_url, scroll=True)

  • video_url (str): The URL of the YouTube video.
  • scroll (bool): Whether to scroll the page to load all comments. Default is True.

Returns:

  • A tuple (comments, page_source) if return_page_source is True, otherwise just the list of comments.

log_info(self, message)

  • Logs an informational message.

log_warning(self, message)

  • Logs a warning message.

log_error(self, message)

  • Logs an error message.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

youtube-comments-scrapper-1.0.1.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

youtube_comments_scrapper-1.0.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file youtube-comments-scrapper-1.0.1.tar.gz.

File metadata

File hashes

Hashes for youtube-comments-scrapper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 2e671a5382e19fa59c33173f9bf5c6f3e5a10cabbeb7a3abd26a2431f5ec670b
MD5 507cb887b0f4bbe7f03e4c18d986df17
BLAKE2b-256 3e3a958b2d520cbed507ffd9a93dbe2487fb20c04ab0712759ce1265fdd1f2e4

See more details on using hashes here.

File details

Details for the file youtube_comments_scrapper-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for youtube_comments_scrapper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb89ba003fd8ad9a7726d5b000c47ec428bead3c85295adc13ea268f4d5e802c
MD5 d7dc7db4ed7b44bdc8d8664267529e22
BLAKE2b-256 437fdbf5ae465a57b4c6afa8859871d9f4e08bf7de6c9a163e40d68bfc0c90f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page