Skip to main content

Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON

Project description

#+TITLE: get-the-nini: Ninisite Post Scraper

A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.

* *Code*: [[file:get_the_nini/main.py]]

* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.

* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.

* *Installation*
This tool can be installed from PyPI using pip.

**Prerequisites**
1. **Python 3**: Ensure you have Python 3 installed.
2. **Pandoc**: The `pypandoc` library is used for converting HTML to other formats. You must have Pandoc installed and available on your system's PATH. Please see the [Pandoc installation instructions](https://pandoc.org/installing.html).

**Install with pip**
To install the package, run the following command in your terminal:
#+begin_src sh
pip install get-the-nini
#+end_src

Or install the latest version from git:
#+begin_src sh :eval never
pip install 'git+https://github.com/NightMachinery/get_the_nini.git'
#+end_src

* *Usage*
Once installed, the script can be run from the command line, providing a topic ID or a full URL.

**Syntax**
#+begin_src sh
get-the-nini [OPTIONS] <TOPIC_ID_OR_URL>
#+end_src

**Examples**

1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
get-the-nini 11473285
#+end_src

2. **Scrape using a full URL**
#+begin_src sh
get-the-nini "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src

3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
get-the-nini 11473285 -o output.md
#+end_src

4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
get-the-nini 11473285 --format json -o - > ninisite_11473285.json
#+end_src

* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.

**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]

**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]

**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

get_the_nini-0.1.4.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

get_the_nini-0.1.4-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file get_the_nini-0.1.4.tar.gz.

File metadata

  • Download URL: get_the_nini-0.1.4.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for get_the_nini-0.1.4.tar.gz
Algorithm Hash digest
SHA256 36d4b029be8e94a1f2eb0c00f528a16213f400a2719d5985a7c425c924ec9b7a
MD5 2b657ac3a0f8c50a2ac9b1c831a17a86
BLAKE2b-256 4b2ac51b1a53eff4d00711ed7185e482e11d2b9a415a56220b620e4200dde09c

See more details on using hashes here.

File details

Details for the file get_the_nini-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: get_the_nini-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for get_the_nini-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 98cf58ac6ff572b062aa91daa2206e124dd5de2db072f795453f7e18c958189f
MD5 73818ab056e73098ccd42654bd0a448d
BLAKE2b-256 54126d12c1d1f806f7aa38848118a6681d31ef3fb02df61c9ac68959776c47e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page