Skip to main content

A package designed to help with scraping post data from imageboards built using Tinyboard.

Project description

Tinyboard logo Tinyboard logo

GitHub License PyPI - Python Version GitHub Issues or Pull Requests

What is Tinyscraper?

Tinyscraper is a Python package that is designed to help with scraping post data from imageboards built using Tinyboard (and Tinyboard forks, such as vichan, etc.) for corpus analysis. Tinyscraper aims to be a helpful tool for both people familiar with building web scraping tools, and people who are not familiar/comfortable with working in Python.

Table of Contents

Installation

In order to use the Tinyscraper package, please install it from PyPi.

[!TIP]
It is a good idea to install Tinyscraper in a virtual environment!

python3 -m venv venv
source venv/bin/activate

To install the Tinyscraper package from PyPi:

pip install tinyscraper

You can also install the Tinyscraper package from the GitHub repository:

pip install git+https://github.com/wkuratana/tinyscraper.git

Usage

[!IMPORTANT]
Tinyscraper is designed to be polite and respectful. It does not currently—and will not in the future—facilitate any behavioral modifications that may overload a website or ignore explicit anti-scraping requests.

Tinyscraper's default behavior is as follows:

  1. Visit the URL passed as an argument.
  2. If the URL links to a specific thread, then scrape the thread. If the URL links to a homepage or a catalog page, then travel to each thread link on the page and scrape each thread.
  3. Write out each scraped thread to its own JSON file in a local data/ directory.

You can use optional arguments to change the output file type (JSON, JSONL, CSV, XML) and the output directory. You can also adjust the naming scheme of the output file. You do not need to tell Tinyscraper if the URL links to a thread, homepage, or catalog page. Tinyscraper can determine that itself.

Once installed, you can check to see the available arguments at any time using:

tinyscraper --help

Simple Scraping

The simplest command you can run is:

tinyscraper <url>

(Replace <url> with an actual URL).

You can use any absolute URL that links to a homepage, to a catalog page, or to a specific thread (of a Tinyboard-based imageboard).

Optional Arguments

Tinyscraper's default file naming convention is thread_<thread_id>_tinyboard.json.

To change the filename entirely, use --filename or -fn:

tinyscraper --filename <name> <url>

[!NOTE] Do not add the filename extension to any filename you use. See how to change the file type below.

[!WARNING] You can only use --filename or -fn if you pass a URL to a specific thread, not a homepage or catalog page. See how else you can modify filenames below.

To change the suffix used in the default file naming convention (which is tinyboard by default), use --filename_suffix or -fns:

tinyscraper --filename_suffix <suffix> <url>

To change the output file type, use --filename_extension or -fne:

tinyscraper --filename_extension <json|jsonl|csv|xml> <url>

To change the output directory, use --directory or -d:

tinyscraper --directory <path> <url>

Example

tinyscraper -fns testchan -fne jsonl -d test_data <url>

The command above visits the URL, then scrapes the thread data into a test_data/ folder. The output file is named thread_<thread_id>_testchan.jsonl

API

Tinyscraper is designed to be used through both its CLI and its API. You can also install and use Tinyscraper as a dependency for other projects. Please reference the docstrings in api.py if you would like to use Tinyscraper without its CLI.

Tinyscraper is built leveraging Scrapy. It can be easily modified if you find that it would better suit your use case with different functionality.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

This project is licensed under the GPL-3.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyscraper-0.1.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyscraper-0.1.0-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file tinyscraper-0.1.0.tar.gz.

File metadata

  • Download URL: tinyscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for tinyscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cfdb6eee14b0753328f6558eeeabce9a1a4a609affd268c3c78d7cb867ca0779
MD5 c4cefcbdadc24a87ec15923490905e10
BLAKE2b-256 ba8b18093373a748c9d21b8cb3ffd5c034349f38c992975856d7931485571a86

See more details on using hashes here.

File details

Details for the file tinyscraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tinyscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d17a5d7d57f72e74a875bc3944d17e130f7bcc658b0a57f4ee4b31510d17f51
MD5 f47e487f5b7ef1b4255cf1f00e2fae68
BLAKE2b-256 448fa8fe1e06554de5ddceb2965e7810f393efbbad5525d3f82221446dc9760a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page