Skip to main content

A simple, purely python, WikiText parsing tool.

Project description

https://travis-ci.org/5j9/wikitextparser.svg?branch=master

WikiTextParser

A simple, purely python, WikiText parsing tool for MediaWiki .

The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, etc. found in wikitexts.

WikiTextParser currently only supports Python 3.3+

Installation

Use pip install wikitextparser

Usage

Here is a short demo of some of the functionalities:

>>> import wikitextparser as wtp

WikiTextParser can detect sections, parserfunctions, templates, wikilinks, external links, arguments, tables, and HTML comments in your wikitext:

>>> wt = wtp.parse("""
== h2 ==
t2

=== h3 ===
t3

== h22 ==
t22

{{text|value1{{text|value2}}}}

[[A|B]]""")
>>>
>>> wt.templates
[Template('{{text|value2}}'), Template('{{text|value1{{text|value2}}}}')]
>>> wt.templates[1].arguments
[Argument("|value1{{text|value2}}")]
>>> wt.templates[1].arguments[0].value = 'value3'
>>> print(wt)

== h2 ==
t2

=== h3 ===
t3

== h22 ==
t22

{{text|value3}}

[[A|B]]

It provides easy-to-use properties so you can get or set names or values of templates, arguments, wikilinks, etc.:

>>> wt.wikilinks
[WikiLink("[[A|B]]")]
>>> wt.wikilinks[0].target = 'Z'
>>> wt.wikilinks[0].text = 'X'
>>> wt.wikilinks[0]
WikiLink('[[Z|X]]')
>>>
>>> from pprint import pprint
>>> pprint(wt.sections)
[Section('\n'),
 Section('== h2 ==\nt2\n\n=== h3 ===\nt3\n\n'),
 Section('=== h3 ===\nt3\n\n'),
 Section('== h22 ==\nt22\n\n{{text|value3}}\n\n[[Z|X]]')]
>>>
>>> wt.sections[1].title = 'newtitle'
>>> print(wt)

==newtitle==
t2

=== h3 ===
t3

== h22 ==
t22

{{text|value3}}

[[Z|X]]

There is a pprint function that pretty-prints templates:

>>> p = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t2, t1 = p.templates
>>> print(t2.pprint())
{{t2
    | e = e
    | f = f
}}
>>> print(t1.pprint())
{{t1
    | b = b
    | c = c
    | d = {{t2
        | e = e
        | f = f
    }}
}}

If you are dealing with [[Category:Pages using duplicate arguments in template calls]] there are two functions that may be helpful:

>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')

Extracting cell values of a table is easy:

>>> p = wtp.parse("""{|
|  Orange    ||   Apple   ||   more
|-
|   Bread    ||   Pie     ||   more
|-
|   Butter   || Ice cream ||  and more
|}""")
>>> pprint(p.tables[0].getdata())
[['Orange', 'Apple', 'more'],
 ['Bread', 'Pie', 'more'],
 ['Butter', 'Ice cream', 'and more']]

And values are rearranged according to colspan and rowspan attributes (by default):

>>> t = wtp.Table("""{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "2" | d || e
|-
|}""")
>>> t.getdata(span=True)
[['a', 'b', 'c'], ['d', 'd', 'e']]

Have a look at the test modules for more details and probable pitfalls.

Compared with mwparserfromhel

mwparserfromhell is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that I needed it for. See mwparserfromhell’s issues 40, 42, 88, and other related issues. In many of those situation wikitextparser may be able to give you more acceptable results.

But if you need to

  • use Python 2

  • parse style tags like ‘’’bold’’’ and ‘’italics’’ (with some limitations of-course)

  • extract HTML tags or entities

then mwparserfromhell or maybe other libraries will be the way to go. Also note that wikitextparser is still under development and the API may change drastically in the future versions.

Of-course wikitextparser has its own unique features, too. Extracting wikitables data as Python lists, pretty-printing templates, and a few other advanced functions. Adding some of the above features are planned for the future…

I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage, but in my limited experience, wikitextparser has a decent performance even though some critical parts of mwparserfromhell (the tokenizer) are written in C . I guess wikitextparser should be able to compete and even have some performance benefits in many situations. Note that wikitextparser does not try to create a complete parse tree, instead tries to figure things out as the user requests for them. However if you are working with on-line data, any difference is usually negligible as the main bottleneck will be the network latency.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikitextparser-0.8.0.zip (46.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikitextparser-0.8.0.win-amd64.exe (651.0 kB view details)

Uploaded Source

File details

Details for the file wikitextparser-0.8.0.zip.

File metadata

  • Download URL: wikitextparser-0.8.0.zip
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for wikitextparser-0.8.0.zip
Algorithm Hash digest
SHA256 3aaee2b06e5517ef44e6691bb080d79447446d59a12615cb7e3a82fce3825d72
MD5 d220545c6f47d7bdf0c17b1ceb8dcd64
BLAKE2b-256 c9faa0729f66aa66c0679bb245835934a4a2f3a3ef2a13345fef6e388a62cbda

See more details on using hashes here.

File details

Details for the file wikitextparser-0.8.0.win-amd64.exe.

File metadata

File hashes

Hashes for wikitextparser-0.8.0.win-amd64.exe
Algorithm Hash digest
SHA256 1b55ed7f16cdb0c81c6d9717f0c31fdc92a8efc8f2c689d4b299e80c6ec7e6cb
MD5 676c21d9504681e1e234592f3c848986
BLAKE2b-256 6da233f2ff8c334fce63191d4a9a04d4bb306c3f708f7ae6800cb96ee615639d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page