Skip to main content

Download posts and threads from forums, news aggregators, mail archives

Project description

forum-dl

Forum-dl is a downloader (scraper) for forums, mailing lists, and news aggregators. It can be used to crawl, extract, and archive individual threads and entire boards into a variety of output formats.

Installation

Clone the repository:

git clone https://github.com/mikwielgus/forum-dl

Then, in the same directory, install the repository directly:

pip install -e forum-dl

Quick start

Download a Simple Machines forum thread in JSONL format:

forum-dl "https://www.simplemachines.org/community/index.php?topic=584230.0"

Download an entire PhpBB forum board in JSONL format, write to stdout (-o -).

forum-dl -o - "https://www.phpbb.com/community/viewforum.php?f=696"

(due to current architecture limitations, forum-dl will shallowly scan the entire forum hierarchy before downloading the board. This will be fixed in future releases)

Download Hacker News top stories and write them to a Maildir directory hn:

forum-dl --textify --content-as-title -f maildir -o hn "https://news.ycombinator.com/news"
  • --textify converts HTML to plaintext (useful for text-only mail clients),
  • --content-as-title puts the beginning of each message's content in its title (useful for mail clients that don't display content in index view),
  • -f maildir changes the output format to maildir,
  • -o hn changes the output directory name to hn.

What is supported

Forum software

  • Discourse
  • Hacker News
  • Hyperkitty
  • Hypermail
  • Invision Power Board
  • PhpBB
  • Pipermail
  • Proboards
  • Simple Machines Forum
  • vBulletin
  • Xenforo

Output formats

  • Babyl
  • JSONL
  • Maildir
  • Mbox
  • MH
  • MMDF

Usage

forum-dl [--help] [--version] [--list-extractors] [--list-output-formats] [--user-agent USER_AGENT] [-q] [-v] [-g] [-o FILE]
         [-f FORMAT] [--no-boards] [--no-threads] [--no-posts] [--textify] [--content-as-title]

General Options:

  --help                Show this help message and exit
  --version             Print program version and exit
  --list-extractors     List all supported extractors and exit
  --list-output-formats
                        List all supported output formats and exit
  --user-agent USER_AGENT
                        User-Agent request header

Output Options:

  -q, --quiet           Activate quiet mode
  -v, --verbose         Print various debugging information
  -g, --get-urls        Print URLs instead of downloading
  -o FILE, --output FILE
                        Output all results concatenated to FILE, or stdout if FILE is -
  -f FORMAT, --output-format FORMAT
                        Output format
  --no-boards           Do not write board objects
  --no-threads          Do not write thread objects
  --no-posts            Do not write post objects
  --textify             Lossily convert HTML content to plaintext
  --content-as-title    Write 98 initial characters of content in title field of each post

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forum_dl-0.1.0.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forum_dl-0.1.0-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file forum_dl-0.1.0.tar.gz.

File metadata

  • Download URL: forum_dl-0.1.0.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for forum_dl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4e562cabaae7c98c1434bac3a622b1cb4f2f58e46065374520fb04c446cacf9e
MD5 46bad9f8bf420d7b9ba5df5a7279a6f6
BLAKE2b-256 42fe78fabe84bd0b8f139f0550c4f46af2def869b134fa9ada906c01accd4d2b

See more details on using hashes here.

File details

Details for the file forum_dl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: forum_dl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for forum_dl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7eb14178085f8e98f93882a12a7a65e904b20e327902e8052816cbf2ee88bc35
MD5 d4256756ce8b6f9260feb06832291956
BLAKE2b-256 9e2279263c9fb88b2134eb5f03d959d2dabdbb55fe07ba3e0104cf0b18b0c17c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page