Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
Project description
#+TITLE: get-the-nini: Ninisite Post Scraper
A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.
* *Code*: [[file:get_the_nini/main.py]]
* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.
* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.
* *Installation*
This tool can be installed from PyPI using pip.
**Prerequisites**
1. **Python 3**: Ensure you have Python 3 installed.
2. **Pandoc**: The `pypandoc` library is used for converting HTML to other formats. You must have Pandoc installed and available on your system's PATH. Please see the [Pandoc installation instructions](https://pandoc.org/installing.html).
**Install with pip**
To install the package, run the following command in your terminal:
#+begin_src sh
pip install get-the-nini
#+end_src
Or install the latest version from git:
#+begin_src sh :eval never
pip install 'git+https://github.com/NightMachinery/get_the_nini.git'
#+end_src
* *Usage*
Once installed, the script can be run from the command line, providing a topic ID or a full URL.
**Syntax**
#+begin_src sh
get-the-nini [OPTIONS] <TOPIC_ID_OR_URL>
#+end_src
**Examples**
1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
get-the-nini 11473285
#+end_src
2. **Scrape using a full URL**
#+begin_src sh
get-the-nini "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src
3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
get-the-nini 11473285 -o output.md
#+end_src
4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
get-the-nini 11473285 --format json -o - > ninisite_11473285.json
#+end_src
* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.
**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]
**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]
**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
get_the_nini-0.1.3.tar.gz
(16.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file get_the_nini-0.1.3.tar.gz.
File metadata
- Download URL: get_the_nini-0.1.3.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0e9f41d804d1793ae6526b523e3fe31c49458b1cc98bc12efe1069cbf9c4e4c
|
|
| MD5 |
17a9a8b809d6641ccf7fdda85a28a4ff
|
|
| BLAKE2b-256 |
1423272ad6940a38452acf88dc9cc45a70d3d3845a8e081212cdcc6036e150a3
|
File details
Details for the file get_the_nini-0.1.3-py3-none-any.whl.
File metadata
- Download URL: get_the_nini-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e7b68588ca5258c5466807b81ae390ccb16a5bbae91945873c9c5a61eb3804b
|
|
| MD5 |
7c103ed01cf9766a1962f9ff191bb906
|
|
| BLAKE2b-256 |
cb2246de07567607dd4e1c5d49ed73758285c1e0c5e94eae19898706a7bb755d
|