Skip to main content

Scrape posts and threads from forums, news aggregators, mail archives

Project description

forum-dl

Forum-dl is a scraper and archiver for forums (including Discourse, PhpBB, SMF), mailing lists, and news aggregators (list). It can be used to extract and archive all posts from individual threads and entire boards into JSONL, Mbox, Maildir, WARC (complete list).

The project is currently in alpha stage. Please do not hesitate to file bug reports and feature requests.

image

Installation

You can install stable Forum-dl from PIP or directly from the repository.

PIP

Install the latest stable version from PIP:

pip install forum-dl

Repository

Clone the repository and install the development branch in editable mode:

git clone https://github.com/mikwielgus/forum-dl && pip install -e forum-dl

Usage examples

Download a Simple Machines Forum thread in JSONL format:

forum-dl "https://www.simplemachines.org/community/index.php?topic=584230.0"

Download a PhpBB subboard into JSONL format, write to stdout (-o -) and record a WARC file in phpbb.warc:

forum-dl -o - --warc-output phpbb.warc "https://www.phpbb.com/community/viewforum.php?f=696"

(due to current architectural limitations, forum-dl will scan the first page of each board in the entire forum before downloading the target board. This will be fixed in future releases)

Download Hacker News top stories and write them to a Maildir directory hn:

forum-dl --textify --content-as-title -f maildir -o hn "https://news.ycombinator.com/news"
  • --textify converts HTML to plaintext (useful for text-only mail clients),
  • --content-as-title puts the beginning of each message's content in its title (useful for mail clients that don't display content in index view),
  • -f maildir changes the output format to maildir,
  • -o hn changes the output directory name to hn.

What is supported

Forum software

  • Discourse
  • Hacker News
  • Hyperkitty
  • Hypermail
  • Invision Power Board
  • PhpBB
  • Pipermail
  • Proboards
  • Simple Machines Forum
  • vBulletin
  • Xenforo

Output formats

  • Babyl
  • JSONL
  • Maildir
  • Mbox
  • MH
  • MMDF
  • WARC

Usage

forum-dl [--help] [--version] [--list-extractors] [--list-output-formats] [--timeout SECONDS] [-R N] [--retry-sleep SECONDS]
         [--retry-sleep-multiplier K] [--user-agent UA] [-q] [-v] [-g] [-o OUTFILE] [-f FORMAT] [--warc-output FILE]
         [--files-output DIR] [--boards | --no-boards] [--threads | --no-threads] [--posts | --no-posts]
         [--files | --no-files] [--outside-files | --no-outside-files] [--textify] [--content-as-title]
         [--author-as-addr-spec]

General Options:

  --help                Show this help message and exit
  --version             Print program version and exit
  --list-extractors     List all supported extractors and exit
  --list-output-formats
                        List all supported output formats and exit

Session Options:

  --timeout SECONDS     HTTP connection timeout
  -R N, --retries N     Maximum number of retries for failed HTTP requests or -1 to retry infinitely (default: 4)
  --retry-sleep SECONDS
                        Time to sleep between retries, in seconds (default: 1)
  --retry-sleep-multiplier K
                        A constant by which sleep time is multiplied on each retry (default: 2)
  --user-agent UA       User-Agent request header

Output Options:

  -q, --quiet           Activate quiet mode
  -v, --verbose         Print various debugging information
  -g, --get-urls        Print URLs instead of downloading
  -o OUTFILE, --output OUTFILE
                        Output all results concatenated to OUTFILE, or stdout if OUTFILE is - (default: -)
  -f FORMAT, --output-format FORMAT
                        Output format. Use --list-output-formats for a list of possible arguments
  --warc-output FILE    Record HTTP requests, store them in FILE in WARC format
  --files-output DIR    Store files in DIR instead of OUTFILE
  --boards, --no-boards
                        Write board objects (default: True, --no-boards to negate)
  --threads, --no-threads
                        Write thread objects (default: True, --no-threads to negate)
  --posts, --no-posts   Write post objects (default: True, --no-posts to negate)
  --files, --no-files   Write embedded files (--no-files to negate)
  --outside-files, --no-outside-files
                        Write embedded files outside post content. Auto-enabled by --warc-output and -f warc (default: False, --no-
                        outside-files to negate)
  --textify             Lossily convert HTML content to plaintext
  --content-as-title    Write 98 initial characters of content in title field of each post
  --author-as-addr-spec
                        Append author and domain as an addr-spec in the From header

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forum_dl-0.3.0.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

forum_dl-0.3.0-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file forum_dl-0.3.0.tar.gz.

File metadata

  • Download URL: forum_dl-0.3.0.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for forum_dl-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6269c87ab07b494fb8887234a1f58461a8376b90cb27d35123a31d836fe40237
MD5 ee657cce320a598d41322c4e35c88d4e
BLAKE2b-256 753f3c0b4e7fa7354bd004778966eea2e98fcdcfb74dbe953128f8df9d71aaba

See more details on using hashes here.

File details

Details for the file forum_dl-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: forum_dl-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 52.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for forum_dl-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c476ade4741c7da62a09f84b47cf00fc79d0860713729c29da5e7d11ab31c92b
MD5 a1ee677013b493c4e54392830cba351d
BLAKE2b-256 ab6fca668a192449838d74c0d8c9e3e7fdbd4b7bf341a7519683cf662ed7a2c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page