A Python package to scrape YouTube comments using Selenium and BeautifulSoup
Project description
YouTube Comment Scraper
YouTubeCommentScraper is a Python package designed to scrape comments from YouTube videos using Selenium. The scraper is customizable, allowing you to run the browser in headless mode, control the timeout, pause time for scrolling, and more. You can also choose whether to log actions and return the page source along with the comments.
Features
- Headless Mode: Run the browser in headless mode (optional).
- Customizable Timeouts: Set the timeout for waiting for elements to load.
- Automatic Scrolling: Automatically scrolls the page until all comments are loaded.
- Logging Support: Enable logging to a file for tracking activities.
- Return Page Source: Optionally return the page source along with the comments.
- BeautifulSoup Integration: Extract comments using BeautifulSoup for robust parsing.
Installation
To install the package, use the following command:
pip install youtube-comments-scrapper
Dependencies
This package requires the following dependencies:
- selenium
- webdriver-manager
- beautifulsoup4
- lxml (optional but recommended for faster HTML parsing)
You can install these dependencies using the following command (optional):
pip install selenium webdriver-manager beautifulsoup4 lxml
Usage
1. Basic Usage: Scraping Comments
Here's a simple example to scrape comments from a YouTube video:
from youtube_comments_scraper import YouTubeCommentScraper
scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)
print("Comments:", comments)
2. Scraping Comments with Logging Enabled
Enable logging to track the actions performed during scraping:
from youtube_comments_scraper import YouTubeCommentScraper
scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=True, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)
print("Comments:", comments)
This will generate a log file (youtube_scraper.log) in the current directory.
3. Returning Page Source Along with Comments
If you want to extract comments and return the page's HTML source:
from youtube_comments_scraper import YouTubeCommentScraper
scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=True)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments, page_source = scraper.scrape_comments(video_url)
print("Comments:", comments)
print("Page Source:", page_source)
4. Custom Scroll Pause Time
You can control how long the scraper pauses between scroll actions using the scroll_pause_time parameter:
from youtube_comments_scraper import YouTubeCommentScraper
scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=2.0, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url)
print("Comments:", comments)
5. Scraping Comments Without Scrolling
If you only want to scrape the comments that load without scrolling:
from youtube_comments_scraper import YouTubeCommentScraper
scraper = YouTubeCommentScraper(headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)
video_url = "https://www.youtube.com/watch?v=Ycg48pVp3SU"
comments = scraper.scrape_comments(video_url, scroll=False)
print("Comments:", comments)
6. Logging Custom Messages
You can log custom messages using the built-in log_info, log_warning, and log_error methods:
scraper.log_info("This is an info log message.")
scraper.log_warning("This is a warning message.")
scraper.log_error("This is an error message.")
Class Reference
YouTubeCommentScraper
__init__(self, headless=True, timeout=10, scroll_pause_time=1.5, enable_logging=False, return_page_source=False)
headless(bool): Run the browser in headless mode. Default isTrue.timeout(int): The maximum time to wait for elements to load. Default is10seconds.scroll_pause_time(float): The pause time between scroll actions. Default is1.5seconds.enable_logging(bool): Whether to enable logging to a file. Default isFalse.return_page_source(bool): Whether to return the page source along with comments. Default isFalse.
scrape_comments(self, video_url, scroll=True)
video_url(str): The URL of the YouTube video.scroll(bool): Whether to scroll the page to load all comments. Default isTrue.
Returns:
- A tuple
(comments, page_source)ifreturn_page_sourceisTrue, otherwise just the list ofcomments.
log_info(self, message)
- Logs an informational message.
log_warning(self, message)
- Logs a warning message.
log_error(self, message)
- Logs an error message.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file youtube-comments-scrapper-0.4.0.tar.gz.
File metadata
- Download URL: youtube-comments-scrapper-0.4.0.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
beb4b979939b9632f97339d31ea93f34af783d498fabb564904409958086ed2d
|
|
| MD5 |
76d96490c918f1d0d874168db354f08d
|
|
| BLAKE2b-256 |
2997943c941832f9d87b29a1c26d9ffb303b8386f17cd34017bc35a9cd97ef62
|
File details
Details for the file youtube_comments_scrapper-0.4.0-py3-none-any.whl.
File metadata
- Download URL: youtube_comments_scrapper-0.4.0-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38dd376b2a56e5aa7ac8575eb3b07bc725a315cafa581cc35fb1da1873f4ac3e
|
|
| MD5 |
8d7603e4a1b90959b306088ce2ecdc8e
|
|
| BLAKE2b-256 |
0c92c58a21236f9d73370e2137b4c1a526e8cfb3c5e03b40e19762cb6948cc51
|