Download posts and threads from forums, news aggregators, mail archives
Project description
forum-dl
Forum-dl is a downloader (scraper) for forums, mailing lists, and news aggregators. It can be used to crawl, extract, and archive individual threads and entire boards into a variety of output formats.
Installation
Clone the repository:
git clone https://github.com/mikwielgus/forum-dl
Then, in the same directory, install the repository directly:
pip install -e forum-dl
Quick start
Download a Simple Machines forum thread in JSONL format:
forum-dl "https://www.simplemachines.org/community/index.php?topic=584230.0"
Download an entire PhpBB forum board in JSONL format, write to stdout (-o -).
forum-dl -o - "https://www.phpbb.com/community/viewforum.php?f=696"
(due to current architecture limitations, forum-dl will shallowly scan the entire forum hierarchy before downloading the board. This will be fixed in future releases)
Download Hacker News top stories and write them to a Maildir directory hn:
forum-dl --textify --content-as-title -f maildir -o hn "https://news.ycombinator.com/news"
--textifyconverts HTML to plaintext (useful for text-only mail clients),--content-as-titleputs the beginning of each message's content in its title (useful for mail clients that don't display content in index view),-f maildirchanges the output format tomaildir,-o hnchanges the output directory name tohn.
What is supported
Forum software
- Discourse
- Hacker News
- Hyperkitty
- Hypermail
- Invision Power Board
- PhpBB
- Pipermail
- Proboards
- Simple Machines Forum
- vBulletin
- Xenforo
Output formats
- Babyl
- JSONL
- Maildir
- Mbox
- MH
- MMDF
Usage
forum-dl [--help] [--version] [--list-extractors] [--list-output-formats] [--user-agent USER_AGENT] [-q] [-v] [-g] [-o FILE]
[-f FORMAT] [--no-boards] [--no-threads] [--no-posts] [--textify] [--content-as-title]
General Options:
--help Show this help message and exit
--version Print program version and exit
--list-extractors List all supported extractors and exit
--list-output-formats
List all supported output formats and exit
--user-agent USER_AGENT
User-Agent request header
Output Options:
-q, --quiet Activate quiet mode
-v, --verbose Print various debugging information
-g, --get-urls Print URLs instead of downloading
-o FILE, --output FILE
Output all results concatenated to FILE, or stdout if FILE is -
-f FORMAT, --output-format FORMAT
Output format
--no-boards Do not write board objects
--no-threads Do not write thread objects
--no-posts Do not write post objects
--textify Lossily convert HTML content to plaintext
--content-as-title Write 98 initial characters of content in title field of each post
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forum_dl-0.1.0.tar.gz.
File metadata
- Download URL: forum_dl-0.1.0.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e562cabaae7c98c1434bac3a622b1cb4f2f58e46065374520fb04c446cacf9e
|
|
| MD5 |
46bad9f8bf420d7b9ba5df5a7279a6f6
|
|
| BLAKE2b-256 |
42fe78fabe84bd0b8f139f0550c4f46af2def869b134fa9ada906c01accd4d2b
|
File details
Details for the file forum_dl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: forum_dl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7eb14178085f8e98f93882a12a7a65e904b20e327902e8052816cbf2ee88bc35
|
|
| MD5 |
d4256756ce8b6f9260feb06832291956
|
|
| BLAKE2b-256 |
9e2279263c9fb88b2134eb5f03d959d2dabdbb55fe07ba3e0104cf0b18b0c17c
|