Skip to main content

A tool for archiving DokuWiki

Project description

DokuWiki Dumper

A tool for archiving DokuWiki.

Recommend using dokuWikiDumper on modern filesystems, such as ext4 or btrfs. NTFS is not recommended because of it denys many special characters in filename.

Requirements

dokuWikiDumper

  • Python 3.8+ (developed on py3.10)
  • beautifulsoup4
  • requests
  • lxml

dokuWikiUploader

Upload wiki dump to Internet Archive. dokuWikiUploader -h for help.

  • internetarchive
  • 7z (7z command)

Install dokuWikiDumper

dokuWikiUploader is included in dokuWikiDumper.

Install dokuWikiDumper with pip (recommended)

https://pypi.org/project/dokuwikidumper/

pip3 install dokuWikiDumper

Install dokuWikiDumper with Poetry (for developers)

  • Install Poetry

    pip3 install poetry
    
  • Install dokuWikiDumper

    git clone https://github.com/saveweb/dokuwiki-dumper
    cd dokuwiki-dumper
    poetry install
    rm dist/ -rf
    poetry build
    pip install --force-reinstall dist/dokuWikiDumper*.whl
    

Usage

usage: dokuWikiDumper [-h] [--content] [--media] [--html] [--skip-to SKIP_TO] [--path PATH] [--no-resume] [--threads THREADS]
                      [--insecure] [--ignore-errors] [--ignore-action-disabled-edit] [--username USERNAME] [--password PASSWORD]
                      [--cookies COOKIES] [--auto]
                      url

dokuWikiDumper

positional arguments:
  url                   URL of the dokuWiki

options:
  -h, --help            show this help message and exit
  --content             Dump content
  --media               Dump media
  --html                Dump HTML
  --pdf                 Dump PDF [default: false] (Only available on some wikis with the PDF export plugin) (Only dumps the latest PDF revision)
  --current-only        Dump latest revision, no history [default: false] (only for HTML at the moment)
  --skip-to SKIP_TO     !DEV! Skip to title number [default: 0]
  --path PATH           Specify dump directory [default: <site>-<date>]
  --no-resume           Do not resume a previous dump [default: resume]
  --threads THREADS     Number of sub threads to use [default: 1], not recommended to set > 5
  --insecure            Disable SSL certificate verification
  --ignore-errors       !DANGEROUS! ignore errors in the sub threads. This may cause incomplete dumps.
  --ignore-action-disabled-edit
                        Some sites disable edit action for anonymous users and some core pages. This option will ignore this error and textarea not found error.But
                        you may only get a partial dump. (only works with --content)
  --username USERNAME   login: username
  --password PASSWORD   login: password
  --cookies COOKIES     cookies file
  --auto                dump: content+media+html, threads=5, ignore-action-disable-edit

For most cases, you can use --auto to dump the site.

dokuWikiDumper https://example.com/wiki/ --auto

which is equivalent to

dokuWikiDumper https://example.com/wiki/ --content --media --html --threads 5 --ignore-action-disabled-edit

Highly recommend using --username and --password to login (or using --cookies), because some sites may disable anonymous users to access some pages or check the raw wikitext.

--cookies accepts a Netscape cookies file, you can use cookies.txt Extension to export cookies from Firefox. It also accepts a json cookies file created by Cookie Quick Manager.

Dump structure

Directory or File Description
attic/ old revisions of page. (wikitext)
dumpMeta/ (dokuWikiDumper only) metadata of the dump.
dumpMeta/check.html ?do=check page of the wiki.
dumpMeta/config.json dump's configuration.
dumpMeta/favicon.ico favicon of the site.
dumpMeta/files.txt list of filename.
dumpMeta/index.html homepage of the wiki.
dumpMeta/info.json infomations of the wiki.
dumpMeta/titles.txt list of page title.
html/ (dokuWikiDumper only) HTML of the pages.
media/ media files.
meta/ metadata of the pages.
pages/ latest page content. (wikitext)

Available Backups/Dumps

I made some backups for testing, you can check out the list: https://github.com/orgs/saveweb/projects/4.

If you dumped a DokuWiki and want to share it, please feel free to open an issue, I will add it to the list.

How to import dump to DokuWiki

If you need to import Dokuwiki, please add the following configuration to local.php

$conf['fnencode'] = 'utf-8'; // Dokuwiki default: 'safe' (url encode)
# 'safe' => Non-ASCII characters will be escaped as %xx form.
# 'utf-8' => Non-ASCII characters will be preserved as UTF-8 characters.

$conf['compression'] = '0'; // Dokuwiki default: 'gz'.
# 'gz' => attic/<id>.<rev_id>.txt.gz
# 'bz2' => attic/<id>.<rev_id>.txt.bz2
# '0' => attic/<id>.<rev_id>.txt

Import pages dir if you only need the latest version of the page.
Import meta dir if you need the changelog of the page.
Import attic and meta dirs if you need the old revisions content of the page.
Import media dir if you need the media files.

dumpMeta and html dirs are only used by dokuWikiDumper, you can ignore it.

Information

DokuWiki links

Other tools

  • MediaWiki Scraper (aka wikiteam3), a tool for archiving MediaWiki, forked from WikiTeam and has been rewritten in Python 3.
  • WikiTeam, a tool for archiving MediaWiki, written in Python 2.

License

GPLv3

Contributors

This tool is based on an unmerged PR (8 years ago!) of WikiTeam: DokuWiki dump alpha by @PiRSquared17.

I (@yzqzss) have rewritten the code in Python 3 and added some features, also fixed some bugs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dokuwikidumper-0.1.7.tar.gz (36.4 kB view hashes)

Uploaded Source

Built Distribution

dokuwikidumper-0.1.7-py3-none-any.whl (41.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page