Skip to main content

Parser and expander for Wikipedia, Wiktionary etc. dump files, with Lua execution support

Project description

wikitextprocessor

This is currently work in progress and expected to be relased in October-November 2020. Until then, feel free to experiment but the code has not yet been fully tested and may be broken on some days. Most things should already work.

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

  • Parsing WikiMedia dumps, including built-in support for processing pages in parallel
  • Wikitext syntax parser that converts the whole page into a parse tree
  • Extracting template definitions and Scribunto Lua module definitions from dump files
  • Expanding selected templates or all templates, and code for heuristically identifying templates that need to be expanded before parsing is reasonably possible (e.g., templates that emit table start and end tags)
  • Processing and expanding Wikitext parser functions
  • Processing, executing, and expanding Scribunto Lua modules (they are very widely used in, e.g., Wiktionary, for example for generating IPA strings for many languages)
  • Controlled expansion parts of pages for applications that parse overall page structure before parsing but then expand templates on certain sections of the page
  • Capturing information from template arguments while expanding them, as template arguments often contain useful information not available in the expanded content.

This module is primarily intended as a building block for other packages that process Wikitionary or Wikipedia data, particularly for data extraction. You will need to write code to use this.

For pre-existing extraction modules that use this package, please see:

  • Wiktextract for extracting rich machine-readable dictionaries from Wiktionary.

Getting started

Installing

The best way to install this package is from pypi:

pip3 install wikitextprocessor

Alternatively, you may install the master branch from github:

git clone https://github.com/tatuylonen/wikitextprocessor
cd wikitextprocessor
pip3 install -e .

Running tests

This package includes tests written using the unittest framework. They can be run using, for example, nose, which can be installed using pip3 install nose.

To run the tests, use the following command in the top-level directory:

nosetests

Obtaining WikiMedia dump files

This package is primarily intended for processing Wiktionary and Wikipedia dump files (though you can also use it for processing individual pages or files that are in Wikitext format). To download WikiMedia dump files, go to the dump download page. We recommend using the --pages-articles.xml.bz2 files (for Wiktionary, this is about 17GB as of October 2020).

Expected performance

This can generally process a few pages second per processor core, including expansion of all templates, Lua macros, and parsing the full page. On a multi-core machine, this can generally process a few dozen pages per second, depending on the speed and number of cores.

API documentation

Usage example:

   from wikitextprocessor import Wtp
   ctx = Wtp()

   def page_handler(model, title, text):
       if model != "wikitext" or title.startswith("Template:"):
           return None
       tree = ctx.parse(text, pre_expand=True)
       ... process parse tree
         ... value = ctx.expand_node(node)

   ctx.process("enwiktionary-20200901-pages-articles.xml.bz2", page_handler)

XXX

class Wtp(object):

    __init__(self, quiet=False, num_threads=None)

    process(path, page_handler)
      - parses dump file, calls page_handler(model, title, text) for each page
        (in parallel using multiprocessing) and returns list of results
      - model is "wikitext" for normal pages and templates, "Scribunto"
        for Lua macros; other values are also possible
      - page_handler may be called in a separate process and cannot update
        external variables; the only way it can communicate out is through
        its return value (except if num_threads=1, in which case it is run
        in the parent process)

    parse(text, pre_expand=False, expand_all=False)
      - parses the text as Wikitext, returning a parse tree.  If pre_expand
        is True, first expands those templates that affect the overall
        Wikitext syntax.  If expand_all is True, then expands all templates
        and Lua macros before parsing.  start_page() must be called before
        this.

    expand(text, pre_only=False, template_fn=None,
           templates_to_expand=None,
           expand_parserfns=True, expand_invoke=True)
      - expands templates, parser functions, and Lua macros from
        the text.  start_page() must be called before this.

    expand_node(node, template_fn=None, templates_to_expand=None,
                expand_parserfns=True, expand_invoke=True)
      - expands the wikitext covered by the given node in a parse tree
        returned by parse()
      - XXX this function has not yet been implemented

    start_page(title)
      - this must be called to start processing a new page
      - automatically called by process() during the second page before
        calling the page handler
      - no need to call this when processing pages via process(), but this
        must be called if processing pages obtained otherwise

    add_page(model, title, text)
      - Adds a new page for interpretation (it could define template, lua
        macros, or could be a normal wikitext page).  Pages are saved in a
        temporary file for use during expansion.
      - This is exposed primarily for testing or for processing single pages
        without reading the whole dump file.
      - This is automatically called by process(), so there is normally no
        need to call this explicitly.

    analyze_templates()
      - Analyzes which templates should be expanded before parsing a page
        (e.g., because they may produce syntactic elements, such as table
        starts or table rows).
      - This is automatically called by process(), so there is normally no
        need to call this explicitly.  However, if templates are added by
        calling add_page() manually, then this should be called after adding
        the last template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikitextprocessor-0.0.3.tar.gz (297.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikitextprocessor-0.0.3-py3-none-any.whl (317.9 kB view details)

Uploaded Python 3

File details

Details for the file wikitextprocessor-0.0.3.tar.gz.

File metadata

  • Download URL: wikitextprocessor-0.0.3.tar.gz
  • Upload date:
  • Size: 297.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.5

File hashes

Hashes for wikitextprocessor-0.0.3.tar.gz
Algorithm Hash digest
SHA256 3448d5dca8ebd5951d7ccee2bc8cdb5cc286cad3fb63fca4ec2e1e2b36059a20
MD5 9f6e088b09a567ad89e12a3f87241b91
BLAKE2b-256 84de0c5dd40193c866d20d06dd04eb74447655f75a02c37af163a516fae218a5

See more details on using hashes here.

File details

Details for the file wikitextprocessor-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: wikitextprocessor-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 317.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.5

File hashes

Hashes for wikitextprocessor-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0f2ded5c5f0cccce9e4144a4b29f57a095f315c8529df78f00b37c202526e8d2
MD5 6c49176b73625a179644644b1ada756f
BLAKE2b-256 8f6e86e06489dea4468115c1ad84e11a89858d3434be454640702006b89e7d47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page