Scraping high intensity content sites
Project description
SiteScraper
This repository contains the following methods:
yt_vedio_data()
This method uses Selenium and Firefox web driver to scrape YouTube videos' title, views, and upload time from a given URL. It returns a dictionary with keys 'title', 'views', and 'when', and corresponding values.
yt_vedio_comment()
This method uses Selenium and Firefox web driver to scrape YouTube comments' text, likes, and time posted from a given URL. It returns a Pandas DataFrame with columns 'comment_text', 'likes', and 'comment_time', and corresponding values.
yt_vedio_links()
This method uses selenium and chrome web driver to scrape out links from youtube. The url should be passed as an argument. The method returns a list.
To use these methods, you will need to have Python 3 installed, along with the following libraries: pandas, selenium, and geckodriver-autoinstaller.
To install the required libraries, you can use pip:
pip install pandas selenium geckodriver-autoinstaller
To run the methods, you will need to import the SiteScraper module and create an instance of the 'yt_vedio' or 'yt_vedio_comment' class:
import SiteScraper as ss
Example code to extract vedio data like likes title and the date when it was posted.
import SiteScraper as ss
import pandas as pd
df = ss.yt_vedio()
new_data = df.yt_vedios_data('https://www.youtube.com/@xxxxxxxxxxx')
dataframe = pd.DataFrame(new_data)
dataframe.to_csv('xxxxx.csv', index=False)
Example code to extract comments from a given youtube vedio.
import SiteScraper as ss
import pandas as pd
df = ss.yt_vedio()
new_data = df.yt_vedio_comment('https://www.youtube.com/watch?v=xxxxxxxx')
dataframe = pd.DataFrame(new_data)
dataframe.to_csv('youtube_comments.csv', index=False)
Example code to extract links of all the vedios posted on youtube.
import SiteScraper as ss
import pandas as pd
d=ss.yt_vedio()
data=d.yt_vedio_links('https://www.youtube.com/xxxxxxx')
df=pd.DataFrame(data)
df.to_csv('xxxx.csv',index=False)
Note that you will need to replace the URL with the actual YouTube URL that you want to scrape. New methods will be commited for twitter and reddit
Note: This is for running SiteScraper in google colab :
#run this command in the cell
!apt-get update
!apt-get install chromium chromium-driver
!pip install selenium
%%shell
# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF
# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg
# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500
Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300
Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file SiteScraper-2.3.4.tar.gz
.
File metadata
- Download URL: SiteScraper-2.3.4.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 988675129156b745c9982499b9408664d0749b3eec794cd0c945d4fc19ede3e0 |
|
MD5 | bc4ec7ea1c5f0edbb90d275cdc099bf2 |
|
BLAKE2b-256 | 38e0260cde28a67423dd0e6f87bdc599398e83185f3baff392275796a7270bc9 |