Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
Project description
#+TITLE: get-the-nini: Ninisite Post Scraper
A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.
* *Code*: [[file:get_the_nini/main.py]]
* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.
* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.
* *Requirements*
This project is written in Python 3. It requires the following libraries:
- `requests`
- `beautifulsoup4`
- `pypandoc`
- `tqdm`
- `pytz`
**Note**: `pypandoc` is a wrapper for **Pandoc**. You must have Pandoc installed and available in your system's PATH for HTML-to-Org/Markdown conversion to work.
* *Usage*
The script is run from the command line, providing a topic ID or a full URL.
**Syntax**
#+begin_src sh
python get_the_nini/main.py [OPTIONS] <TOPIC_ID_OR_URL>
#+end_src
**Examples**
1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
python get_the_nini/main.py 11473285
#+end_src
2. **Scrape using a full URL**
#+begin_src sh
python get_the_nini/main.py "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src
3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
python get_the_nini/main.py 11473285 -o output.md
#+end_src
4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
python get_the_nini/main.py 11473285 --format json -o - > ninisite_11473285.json
#+end_src
* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.
**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]
**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]
**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
get_the_nini-0.1.1.tar.gz
(14.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file get_the_nini-0.1.1.tar.gz.
File metadata
- Download URL: get_the_nini-0.1.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18dedb77fb83a5cdf37cb598c389ae44bc49130e04d9947fde27c07b42ead4a6
|
|
| MD5 |
e17c5f8bac1f6fcaeacd4d6b5494bc9e
|
|
| BLAKE2b-256 |
07c343fad9d17d6dc67ed510c98d4cff90b54cb2fd767f4b5502d09b23e4daac
|
File details
Details for the file get_the_nini-0.1.1-py3-none-any.whl.
File metadata
- Download URL: get_the_nini-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.10.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b32d1f38001193b29ba9e7f0d3be6e964c7f6d12495c4afa3e6ac01ce73c55d8
|
|
| MD5 |
55530878a10b07a043f15a6084f80cbf
|
|
| BLAKE2b-256 |
2d07b28604ad3af30cc6de20581b2c592e88371adbbf3767d4987796bd3f8677
|