Skip to main content

Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON

Project description

#+TITLE: get-the-nini: Ninisite Post Scraper

A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.

* *Code*: [[file:get_the_nini/main.py]]

* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.

* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.

* *Requirements*
This project is written in Python 3. It requires the following libraries:
- `requests`
- `beautifulsoup4`
- `pypandoc`
- `tqdm`
- `pytz`

**Note**: `pypandoc` is a wrapper for **Pandoc**. You must have Pandoc installed and available in your system's PATH for HTML-to-Org/Markdown conversion to work.

* *Usage*
The script is run from the command line, providing a topic ID or a full URL.

**Syntax**
#+begin_src sh
python get_the_nini/main.py [OPTIONS] <TOPIC_ID_OR_URL>
#+end_src

**Examples**

1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
python get_the_nini/main.py 11473285
#+end_src

2. **Scrape using a full URL**
#+begin_src sh
python get_the_nini/main.py "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src

3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
python get_the_nini/main.py 11473285 -o output.md
#+end_src

4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
python get_the_nini/main.py 11473285 --format json -o - > ninisite_11473285.json
#+end_src

* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.

**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]

**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]

**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

get_the_nini-0.1.1.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

get_the_nini-0.1.1-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file get_the_nini-0.1.1.tar.gz.

File metadata

  • Download URL: get_the_nini-0.1.1.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for get_the_nini-0.1.1.tar.gz
Algorithm Hash digest
SHA256 18dedb77fb83a5cdf37cb598c389ae44bc49130e04d9947fde27c07b42ead4a6
MD5 e17c5f8bac1f6fcaeacd4d6b5494bc9e
BLAKE2b-256 07c343fad9d17d6dc67ed510c98d4cff90b54cb2fd767f4b5502d09b23e4daac

See more details on using hashes here.

File details

Details for the file get_the_nini-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: get_the_nini-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for get_the_nini-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b32d1f38001193b29ba9e7f0d3be6e964c7f6d12495c4afa3e6ac01ce73c55d8
MD5 55530878a10b07a043f15a6084f80cbf
BLAKE2b-256 2d07b28604ad3af30cc6de20581b2c592e88371adbbf3767d4987796bd3f8677

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page