Skip to main content

Utility for batch downloading (certain) pages from MediaWiki sites as printable PDFs.

Project description

mwpdfify

Batch download multiple pages from MediaWiki sites (All pages or pages of a category) to printable PDFs.

Install / Run

pip install mwpdfify

...or clone repo and pip install .

...or directly download and run src/mwpdfify.py

There are two PDF rendering backends to choose from: pdfkit (installed as a dependency by default) or weasyprint. Use pip install -r requirements.txt to install both or choose one yourself. If using the former remember to also install wkhtmltopdf on your system.

Usage

  1. Get the address of the root of your wiki, where its api.php and index.php resides. Typically it's identical to the site's root (/). For Wikipedia it's at /w/; tell me if there are other exceptions ;)
  2. (optional) If you want only a specific category, get its title (in the form of Category:XXX)
  3. Run the script. eg.:
    • mwpdfify https://lycoris-recoil.fandom.com - Download all pages (as in Special:AllPages) from Lycoris Recoil Fandom Wiki as PDF
    • mwpdfify wiki.archlinux.org -c Category:Installation_process - Download all pages under Category:Installation_process from ArchWiki as PDF
    • mwpdfify https://en.wikipedia.org/w/ -c Category:Guangzhou_Metro_stations -l 10 -t 4 - Download all pages under Category:Guangzhou_Metro_stations (except subcategories) from Wikipedia, with 4 download threads and an one-time query limit of 10

The downloaded PDFs should be avaliable in a folder marked with the site's domain name in the current directory.

See below for other parameters:

usage: mwpdfify [-h] [-c CATEGORY] [-p] [-t THREADS] [-l LIMIT] [-w] url

positional arguments:
  url                   site root of destination site

options:
  -h, --help            show this help message and exit
  -c CATEGORY, --category CATEGORY
                        Download only a specified category
  -p, --no-printable    Force normal instead of printable version of pages
  -t THREADS, --threads THREADS
                        Number of download threads, defaults to 8
  -l LIMIT, --limit LIMIT
                        Limit of JSON info returned at once, defaults to maximum
                        (0)
  -w, --use-weasyprint  Use weasyprint as PDF rendering backend

Known issues

  • &printable=yes is deprecated in recent versions of MediaWiki (while no substitute API solutions are provided) so there might be layout issues when used with certain wikis; especially Fandom wikis as they also contain ads.
  • Recursively download pages from subcategories of a category is currently not supported.

Changelog

  • v1.1.2 (2022/09/30):
    • Set pdfkit as required dependency
  • v1.1 (2022/09/04):
    • Changed address handling logic
    • Bug fixes
  • v1.0 (2022/09/03):
    • Initial release

License

LGPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwpdfify-1.1.2.tar.gz (7.4 kB view hashes)

Uploaded Source

Built Distribution

mwpdfify-1.1.2-py3-none-any.whl (8.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page