Skip to main content

A Python library designed for scraping data from the SCP wiki.

Project description

SCP Scraper

A small Python library designed for scraping data from the SCP wiki. Made with AI training (namely NLP models) and dataset collection (for things like categorization of SCPs for external projects) in mind, and has arguments to allow for ease of use in those applications.

Below you will find installation instructions, examples of how to use this library, and the ways in which you can utilize it. I hope you find this as useful as I have!

Sample Code

Installation

scpscraper can be installed via pip install. Here's the command I recommend using, so you consistently have the latest version.

pip3 install --upgrade scpscraper

The Basics

Importing the Library

# Before we begin, we obviously have to import scpscraper.
import scpscraper

Grabbing an SCP's Name

# Let's use 3001 (Red Reality) as an example.
name = scpscraper.get_scp_name(3001)

print(name) # Outputs "Red Reality"

Grabbing as many details as possible about an SCP

# Again using 3001 as an example
info = scpscraper.get_scp(3001)

print(info) # Outputs a dictionary with the
# name, object id, rating, page content by section, etc.

The Fun Stuff

Grabbing an SCP's page-content div HTML

For reference, the page-content div contains what the user actually wrote, without all the extra Wikidot external stuff.

# Once again, 3001 is the example
scp = scpscraper.get_single_scp(3001)

# Grab the page-content div specifically
content = scp.find_all('div', id='page-content')

print(content) # Outputs "<div id="page-content"> ... </div>"

Scraping HTML or information from multiple SCPs

# Grab info on SCPs 000-099
scpscraper.scrape_scps(0, 100)

# Same as above, but only grabbing Keter-class SCPs
scpscraper.scrape_scps(0, 100, tags=['keter'])

# Grab 000-099 in a format that can be used to train AI
scpscraper.scrape_scps(0, 100, ai_dataset=True)
# Scrape the page-content div's HTML from SCP-000 to SCP-099

# Only including this as an example, but scrape_scps_html() has
# all the same options as scrape_scps().
scpscraper.scrape_scps_html(0, 100)

Google Colaboratory Only Usage

Because of the google.colab module included in Google Colaboratory, we can do a few extra things there that we can't otherwise.

Mount your Google Drive to the Colaboratory VM

# Mounts it to the directory /content/drive/
scpscraper.gdrive.mount()

Scrape SCP info/HTML and copy to your Google Drive afterwards

# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.scrape_scps(0, 100, copy_to_drive=True)

scpscraper.scrape_scps_html(0, 100, copy_to_drive=True)

Copy other files to/from your Google Drive

# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.gdrive.copy_to_drive('example.txt')

scpscraper.gdrive.copy_from_drive('example.txt')

Planned Updates

Potential updates in the future to make scraping data from any website easy/viable, allowing for easy mass collection of data.

Link to GitHub Repo

Please consider checking it out! You can report issues, request features, contribute to this project, etc. in the GitHub Repo. That is the best way to reach me for issues/feedback relating to this project.

https://github.com/JaonHax/scpscraper/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scpscraper-1.0.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

scpscraper-1.0.1-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file scpscraper-1.0.1.tar.gz.

File metadata

  • Download URL: scpscraper-1.0.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for scpscraper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 66564549e96e8a47061b9f46aa1dbd0cf067da10da71c0bb5102e052cee695f2
MD5 6fab2ac12e84536b5f19e7b617908232
BLAKE2b-256 f578da3c7c4bafed046b702b969d0952b53b6862706e2074650d098acf12c4f1

See more details on using hashes here.

File details

Details for the file scpscraper-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: scpscraper-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for scpscraper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2613190565efdfc77c167be7af87a06b10e592aa8d78e3359a655fbe59d05575
MD5 027408fdaf7c9d2cd15517e6f4bbb437
BLAKE2b-256 63b7a1242840e018d7d657e6a073129ad03d1257125dbdb74ae66d808e539dcd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page