Skip to main content

Utility for batch downloading (certain) pages from MediaWiki sites as printable PDFs.

Project description

mwpdfify

Batch download multiple pages from MediaWiki sites (All pages or pages of a category) to printable PDFs.

Install / Run

pip install mwpdfify

...or clone repo and pip install .

...or directly download and run src/mwpdfify.py

There are two PDF rendering backends to choose from: pdfkit (installed as a dependency by default) or weasyprint. Use pip install -r requirements.txt to install both or choose one yourself. If using the former remember to also install wkhtmltopdf on your system.

Usage

  1. Get the address of the root of your wiki, where its api.php and index.php resides. Typically it's identical to the site's root (/). For Wikipedia it's at /w/; tell me if there are other exceptions ;)
  2. (optional) If you want only a specific category, get its title (in the form of Category:XXX)
  3. Run the script. eg.:
    • mwpdfify https://lycoris-recoil.fandom.com - Download all pages (as in Special:AllPages) from Lycoris Recoil Fandom Wiki as PDF
    • mwpdfify wiki.archlinux.org -c Category:Installation_process - Download all pages under Category:Installation_process from ArchWiki as PDF
    • mwpdfify https://en.wikipedia.org/w/ -c Category:Guangzhou_Metro_stations -l 10 -t 4 - Download all pages under Category:Guangzhou_Metro_stations (except subcategories) from Wikipedia, with 4 download threads and an one-time query limit of 10

The downloaded PDFs should be avaliable in a folder marked with the site's domain name in the current directory.

See below for other parameters:

usage: mwpdfify [-h] [-c CATEGORY] [-p] [-t THREADS] [-l LIMIT] [-w] url

positional arguments:
  url                   site root of destination site

options:
  -h, --help            show this help message and exit
  -c CATEGORY, --category CATEGORY
                        Download only a specified category
  -p, --no-printable    Force normal instead of printable version of pages
  -t THREADS, --threads THREADS
                        Number of download threads, defaults to 8
  -l LIMIT, --limit LIMIT
                        Limit of JSON info returned at once, defaults to maximum
                        (0)
  -w, --use-weasyprint  Use weasyprint as PDF rendering backend

Known issues

  • &printable=yes is deprecated in recent versions of MediaWiki (while no substitute API solutions are provided) so there might be layout issues when used with certain wikis; especially Fandom wikis as they also contain ads.
  • Recursively download pages from subcategories of a category is currently not supported.

Changelog

  • v1.1.2 (2022/09/30):
    • Set pdfkit as required dependency
  • v1.1 (2022/09/04):
    • Changed address handling logic
    • Bug fixes
  • v1.0 (2022/09/03):
    • Initial release

License

LGPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwpdfify-1.1.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

mwpdfify-1.1.2-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file mwpdfify-1.1.2.tar.gz.

File metadata

  • Download URL: mwpdfify-1.1.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.7

File hashes

Hashes for mwpdfify-1.1.2.tar.gz
Algorithm Hash digest
SHA256 759d6d3ce35b6f5ba9aba561c889b14e3597483a002fd8a5f99a5428160bfe60
MD5 3eed7dd9cb49d11e383350d606ec9f48
BLAKE2b-256 3df349cc7f76bbc23099060878e89cdee9f6f05e77f7a06229348cc250651108

See more details on using hashes here.

File details

Details for the file mwpdfify-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: mwpdfify-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.7

File hashes

Hashes for mwpdfify-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2c8d356ed1d43c78aa5ee4ceaf167850c7377070972d957e5029c62ba98d3540
MD5 acd456c4c8d7986f610261148bf040eb
BLAKE2b-256 ea4daaccf8dd26ee90d0254c0e8b0c72ab51e6145ad452b4915a6f73c9a1fafb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page