Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
Project description
#+TITLE: get-the-nini: Ninisite Post Scraper
A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.
* *Code*: [[file:get_the_nini/main.py]]
* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.
* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.
* *Installation*
This tool can be installed from PyPI using pip.
**Prerequisites**
1. **Python 3**: Ensure you have Python 3 installed.
2. **Pandoc**: The `pypandoc` library is used for converting HTML to other formats. You must have Pandoc installed and available on your system's PATH. Please see the [Pandoc installation instructions](https://pandoc.org/installing.html).
**Install with pip**
To install the package, run the following command in your terminal:
#+begin_src sh
pip install get-the-nini
#+end_src
Or install the latest version from git:
#+begin_src sh :eval never
pip install 'git+https://github.com/NightMachinery/get_the_nini.git'
#+end_src
* *Usage*
Once installed, the script can be run from the command line, providing a topic ID or a full URL.
**Syntax**
#+begin_src sh
get-the-nini [OPTIONS] <TOPIC_ID_OR_URL>
#+end_src
**Examples**
1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
get-the-nini 11473285
#+end_src
2. **Scrape using a full URL**
#+begin_src sh
get-the-nini "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src
3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
get-the-nini 11473285 -o output.md
#+end_src
4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
get-the-nini 11473285 --format json -o - > ninisite_11473285.json
#+end_src
* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.
**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]
**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]
**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
get_the_nini-0.1.4.tar.gz
(16.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file get_the_nini-0.1.4.tar.gz.
File metadata
- Download URL: get_the_nini-0.1.4.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d4b029be8e94a1f2eb0c00f528a16213f400a2719d5985a7c425c924ec9b7a
|
|
| MD5 |
2b657ac3a0f8c50a2ac9b1c831a17a86
|
|
| BLAKE2b-256 |
4b2ac51b1a53eff4d00711ed7185e482e11d2b9a415a56220b620e4200dde09c
|
File details
Details for the file get_the_nini-0.1.4-py3-none-any.whl.
File metadata
- Download URL: get_the_nini-0.1.4-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98cf58ac6ff572b062aa91daa2206e124dd5de2db072f795453f7e18c958189f
|
|
| MD5 |
73818ab056e73098ccd42654bd0a448d
|
|
| BLAKE2b-256 |
54126d12c1d1f806f7aa38848118a6681d31ef3fb02df61c9ac68959776c47e2
|