Skip to main content

Scraping high intensity content sites

Project description

SiteScraper

This repository contains the following methods:

yt_vedio_data()

This method uses Selenium and Firefox web driver to scrape YouTube videos' title, views, and upload time from a given URL. It returns a dictionary with keys 'title', 'views', and 'when', and corresponding values.

yt_vedio_comment()

This method uses Selenium and Firefox web driver to scrape YouTube comments' text, likes, and time posted from a given URL. It returns a Pandas DataFrame with columns 'comment_text', 'likes', and 'comment_time', and corresponding values.

yt_vedio_links()

This method uses selenium and chrome web driver to scrape out links from youtube. The url should be passed as an argument. The method returns a list.

To use these methods, you will need to have Python 3 installed, along with the following libraries: pandas, selenium, and geckodriver-autoinstaller.

To install the required libraries, you can use pip:

pip install pandas selenium geckodriver-autoinstaller

To run the methods, you will need to import the SiteScraper module and create an instance of the 'yt_vedio' or 'yt_vedio_comment' class:

import SiteScraper as ss

Example code to extract vedio data like likes title and the date when it was posted.


import SiteScraper as ss
import pandas as pd 
df = ss.yt_vedio()
new_data = df.yt_vedios_data('https://www.youtube.com/@xxxxxxxxxxx')
dataframe = pd.DataFrame(new_data)
dataframe.to_csv('xxxxx.csv', index=False)

Example code to extract comments from a given youtube vedio.


import SiteScraper as ss
import pandas as pd 
df = ss.yt_vedio()
new_data = df.yt_vedio_comment('https://www.youtube.com/watch?v=xxxxxxxx')
dataframe = pd.DataFrame(new_data)
dataframe.to_csv('youtube_comments.csv', index=False)

Example code to extract links of all the vedios posted on youtube.



import SiteScraper as ss
import pandas as pd 
d=ss.yt_vedio()
data=d.yt_vedio_links('https://www.youtube.com/xxxxxxx')
df=pd.DataFrame(data)
df.to_csv('xxxx.csv',index=False)


Note that you will need to replace the URL with the actual YouTube URL that you want to scrape. New methods will be commited for twitter and reddit

Note: This is for running SiteScraper in google colab :


#run this command in the cell 
!apt-get update
!apt-get install chromium  chromium-driver
!pip install selenium


%%shell

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SiteScraper-2.3.4.tar.gz (4.4 kB view details)

Uploaded Source

File details

Details for the file SiteScraper-2.3.4.tar.gz.

File metadata

  • Download URL: SiteScraper-2.3.4.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for SiteScraper-2.3.4.tar.gz
Algorithm Hash digest
SHA256 988675129156b745c9982499b9408664d0749b3eec794cd0c945d4fc19ede3e0
MD5 bc4ec7ea1c5f0edbb90d275cdc099bf2
BLAKE2b-256 38e0260cde28a67423dd0e6f87bdc599398e83185f3baff392275796a7270bc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page