A Python library designed for scraping data from the SCP wiki.
Project description
SCP Scraper
A small Python library designed for scraping data from the SCP wiki. Made with AI training (namely NLP models) and dataset collection (for things like categorization of SCPs for external projects) in mind, and has arguments to allow for ease of use in those applications.
Below you will find installation instructions, examples of how to use this library, and the ways in which you can utilize it. I hope you find this as useful as I have!
Sample Code
Installation
scpscraper
can be installed via pip install
. Here's the command I recommend using, so you consistently have the latest version.
pip3 install --upgrade scpscraper
The Basics
Importing the Library
# Before we begin, we obviously have to import scpscraper.
import scpscraper
Grabbing an SCP's Name
# Let's use 3001 (Red Reality) as an example.
name = scpscraper.get_scp_name(3001)
print(name) # Outputs "Red Reality"
Grabbing as many details as possible about an SCP
# Again using 3001 as an example
info = scpscraper.get_scp(3001)
print(info) # Outputs a dictionary with the
# name, object id, rating, page content by section, etc.
The Fun Stuff
Grabbing an SCP's page-content
div HTML
For reference, the page-content
div contains what the user actually wrote, without all the extra Wikidot external stuff.
# Once again, 3001 is the example
scp = scpscraper.get_single_scp(3001)
# Grab the page-content div specifically
content = scp.find_all('div', id='page-content')
print(content) # Outputs "<div id="page-content"> ... </div>"
Scraping HTML or information from multiple SCPs
# Grab info on SCPs 000-099
scpscraper.scrape_scps(0, 100)
# Same as above, but only grabbing Keter-class SCPs
scpscraper.scrape_scps(0, 100, tags=['keter'])
# Grab 000-099 in a format that can be used to train AI
scpscraper.scrape_scps(0, 100, ai_dataset=True)
# Scrape the page-content div's HTML from SCP-000 to SCP-099
# Only including this as an example, but scrape_scps_html() has
# all the same options as scrape_scps().
scpscraper.scrape_scps_html(0, 100)
Google Colaboratory Only Usage
Because of the google.colab
module included in Google Colaboratory, we can do a few extra things there that we can't otherwise.
Mount your Google Drive to the Colaboratory VM
# Mounts it to the directory /content/drive/
scpscraper.gdrive.mount()
Scrape SCP info/HTML and copy to your Google Drive afterwards
# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.scrape_scps(0, 100, copy_to_drive=True)
scpscraper.scrape_scps_html(0, 100, copy_to_drive=True)
Copy other files to/from your Google Drive
# Requires your Google Drive to be mounted at the directory /content/drive/
scpscraper.gdrive.copy_to_drive('example.txt')
scpscraper.gdrive.copy_from_drive('example.txt')
Planned Updates
Potential updates in the future to make scraping data from any website easy/viable, allowing for easy mass collection of data.
Link to GitHub Repo
Please consider checking it out! You can report issues, request features, contribute to this project, etc. in the GitHub Repo. That is the best way to reach me for issues/feedback relating to this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scpscraper-1.0.1.tar.gz
.
File metadata
- Download URL: scpscraper-1.0.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66564549e96e8a47061b9f46aa1dbd0cf067da10da71c0bb5102e052cee695f2 |
|
MD5 | 6fab2ac12e84536b5f19e7b617908232 |
|
BLAKE2b-256 | f578da3c7c4bafed046b702b969d0952b53b6862706e2074650d098acf12c4f1 |
File details
Details for the file scpscraper-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: scpscraper-1.0.1-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2613190565efdfc77c167be7af87a06b10e592aa8d78e3359a655fbe59d05575 |
|
MD5 | 027408fdaf7c9d2cd15517e6f4bbb437 |
|
BLAKE2b-256 | 63b7a1242840e018d7d657e6a073129ad03d1257125dbdb74ae66d808e539dcd |