A simple, purely python, WikiText parsing tool.
Project description
WikiTextParser
A simple to use WikiText parsing library for MediaWiki.
The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, etc. found in wikitexts.
WikiTextParser currently only supports Python 3.3+
Installation
Use pip install wikitextparser
Usage
Here is a short demo of some of the functionalities:
>>> import wikitextparser as wtp
WikiTextParser can detect sections, parserfunctions, templates, wikilinks, external links, arguments, tables, and HTML comments in your wikitext:
>>> wt = wtp.parse("""
== h2 ==
t2
=== h3 ===
t3
== h22 ==
t22
{{text|value1{{text|value2}}}}
[[A|B]]""")
>>>
>>> wt.templates
[Template('{{text|value2}}'), Template('{{text|value1{{text|value2}}}}')]
>>> wt.templates[1].arguments
[Argument("|value1{{text|value2}}")]
>>> wt.templates[1].arguments[0].value = 'value3'
>>> print(wt)
== h2 ==
t2
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[A|B]]
It provides easy-to-use properties so you can get or set names or values of templates, arguments, wikilinks, etc.:
>>> wt.wikilinks
[WikiLink("[[A|B]]")]
>>> wt.wikilinks[0].target = 'Z'
>>> wt.wikilinks[0].text = 'X'
>>> wt.wikilinks[0]
WikiLink('[[Z|X]]')
>>>
>>> from pprint import pprint
>>> pprint(wt.sections)
[Section('\n'),
Section('== h2 ==\nt2\n\n=== h3 ===\nt3\n\n'),
Section('=== h3 ===\nt3\n\n'),
Section('== h22 ==\nt22\n\n{{text|value3}}\n\n[[Z|X]]')]
>>>
>>> wt.sections[1].title = 'newtitle'
>>> print(wt)
==newtitle==
t2
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
There is a pprint function that pretty-prints templates:
>>> p = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t2, t1 = p.templates
>>> print(t2.pprint())
{{t2
| e = e
| f = f
}}
>>> print(t1.pprint())
{{t1
| b = b
| c = c
| d = {{t2
| e = e
| f = f
}}
}}
If you are dealing with [[Category:Pages using duplicate arguments in template calls]] there are two functions that may be helpful:
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')
Extracting cell values of a table is easy:
>>> p = wtp.parse("""{|
| Orange || Apple || more
|-
| Bread || Pie || more
|-
| Butter || Ice cream || and more
|}""")
>>> pprint(p.tables[0].data())
[['Orange', 'Apple', 'more'],
['Bread', 'Pie', 'more'],
['Butter', 'Ice cream', 'and more']]
And values are rearranged according to colspan and rowspan attributes (by default):
>>> t = wtp.Table("""{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "2" | d || e
|-
|}""")
>>> t.data(span=True)
[['a', 'b', 'c'], ['d', 'd', 'e']]
By calling the cells method of a Table, you can access table cells as Cell objects which provide methods for getting or setting each cell’s attributes and values individually.
>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t.string)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}
Cell (and Tag) instances have get, set, has, and delete methods to access their attributes.
Have a look at the test modules for more examples and probable pitfalls.
Compared with mwparserfromhell
mwparserfromhell is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that I needed it for. See mwparserfromhell’s issues 40, 42, 88, and other related issues. In many of those situation wikitextparser may be able to give you more acceptable results.
But if you need to
use Python 2
parse style tags like ‘’’bold’’’ and ‘’italics’’ (with some limitations of-course)
then mwparserfromhell or maybe other libraries will be the way to go. Also note that wikitextparser is still under development and the API may change drastically in the future versions.
Adding some of the features above is planned for the future…
Of-course wikitextparser has its own unique features, too. Accessing to individual cells of each table, pretty-printing templates, and a few other advanced functions.
I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage, but in my limited experience, wikitextparser has a decent performance even though some critical parts of mwparserfromhell (the tokenizer) are written in C. I guess wikitextparser should be able to compete and even have some performance benefits in many situations. Note that wikitextparser does not try to create a complete parse tree, instead tries to figure things out as the user requests for them. However if you are working with on-line data, any difference is usually negligible as the main bottleneck will be the network latency.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wikitextparser-0.10.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2498e871041dade4101bde95184848d9dc3e5fa539a672061e4c1e9040cf1e11 |
|
MD5 | e13136fcd68219fc0a98c028e587dd29 |
|
BLAKE2b-256 | f991ab3f1051c150f90cc7e2e74a4ea52209882e76c04cc64ecfcdc2c6b9b57e |