wikitextparser

A simple parsing tool for MediaWiki's wikitext markup.

These details have not been verified by PyPI

Project links

Homepage

Project description

https://travis-ci.org/5j9/wikitextparser.svg?branch=master

WikiTextParser

A simple to use WikiText parsing library for MediaWiki.

The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.

Installation

Python 3.4+ is required
pip install 'setuptools>=36.2.1'
pip install wikitextparser

Usage

>>> import wikitextparser as wtp

WikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are quick of some of these functionalities.

You may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).

Templates

>>> parsed = wtp.parse("{{text|value1{{text|value2}}}}")
>>> parsed.templates
[Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]
>>> parsed.templates[0].arguments
[Argument("|value1{{text|value2}}")]
>>> parsed.templates[0].arguments[0].value = 'value3'
>>> print(parsed)
{{text|value3}}

The pformat method returns a pretty-print formatted string for templates:

>>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t1, t2 = parsed.templates
>>> print(t2.pformat())
{{t2
    | e = e
    | f = f
}}
>>> print(t1.pformat())
{{t1
    | b = b
    | c = c
    | d = {{t2
        | e = e
        | f = f
    }}
}}

Template.rm_dup_args_safe and Template.rm_first_of_dup_args methods can be used to clean-up pages using duplicate arguments in template calls:

>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')

Template parameters:

>>> param = wtp.parse('{{{a|b}}}').parameters[0]
>>> param.name
'a'
>>> param.default
'b'
>>> param.default = 'c'
>>> param
Parameter('{{{a|c}}}')
>>> param.append_default('d')
>>> param
Parameter('{{{a|{{{d|c}}}}}}')

WikiLinks

>>> parsed = wtp.parse('text [[A|B]] text')
>>> wl = parsed.wikilinks[0]
>>> wl
WikiLink('[[A|B]]')
>>> wl.target = 'Z'
>>> wl.text = 'X'
>>> parsed
WikiText('text [[Z|X]] text')

Sections

>>> parsed = wtp.parse("""
... == h2 ==
... t2
... === h3 ===
... t3
... === h3 ===
... t3
... == h22 ==
... t22
... {{text|value3}}
... [[Z|X]]
... """)
>>> parsed.sections
[Section('\n'),
 Section('== h2 ==\nt2\n=== h3 ===\nt3\n=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('== h22 ==\nt22\n{{text|value3}}\n[[Z|X]]\n')]
>>> parsed.sections[1].title = 'newtitle'
>>> print(parsed)

==newtitle==
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]

Tables

Extracting cell values of a table:

>>> p = wtp.parse("""{|
... |  Orange    ||   Apple   ||   more
... |-
... |   Bread    ||   Pie     ||   more
... |-
... |   Butter   || Ice cream ||  and more
... |}""")
>>> p.tables[0].data()
[['Orange', 'Apple', 'more'],
 ['Bread', 'Pie', 'more'],
 ['Butter', 'Ice cream', 'and more']]

By default, values are arranged according to colspan and rowspan attributes:

>>> t = wtp.Table("""{| class="wikitable sortable"
... |-
... ! a !! b !! c
... |-
... !colspan = "2" | d || e
... |-
... |}""")
>>> t.data()
[['a', 'b', 'c'], ['d', 'd', 'e']]
>>> t.data(span=False)
[['a', 'b', 'c'], ['d', 'e']]

Calling the cells method of a Table returns table cells as Cell objects. Cell objects provide methods for getting or setting each cell’s attributes or values individually:

>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}

HTML attributes of Table, Cell, and Tag objects are accessible via get_attr, set_attr, has_attr, and del_atrr methods.

Lists

The lists method provides access to lists within the wikitext.

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']

The sublists method can be used to get all sub-lists of the current list or just sub-lists of specific items:

>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']

It also has an optional pattern argument that works similar to lists, except that the current list pattern will be automatically added to it as a prefix:

>>> wikilist = wtp.WikiList('#a\n#b\n##ba\n#*bb\n#:bc\n#c', '\#')
>>> wikilist.sublists()
[WikiList('##ba\n'), WikiList('#*bb\n'), WikiList('#:bc\n')]
>>> wikilist.sublists(pattern='\*')
[WikiList('#*bb\n')]

Convert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:

>>> wl = wtp.WikiList(
...     ':*A1\n:*#B1\n:*#B2\n:*:continuing A1\n:*A2',
...     pattern=':\*'
... )
>>> print(wl)
:*A1
:*#B1
:*#B2
:*:continuing A1
:*A2
>>> wl.convert('#')
>>> print(wl)
#A1
##B1
##B2
#:continuing A1
#A2

Miscellaneous

parent and ancestors methods can be used to access a node’s parent or ancestors respectively:

>>> template_d = parse("{{a|{{b|{{c|{{d}}}}}}}}").templates[3]
>>> template_d.ancestors()
[Template('{{c|{{d}}}}'),
 Template('{{b|{{c|{{d}}}}}}'),
 Template('{{a|{{b|{{c|{{d}}}}}}}}')]
>>> template_d.parent()
Template('{{c|{{d}}}}')
>>> _.parent()
Template('{{b|{{c|{{d}}}}}}')
>>> _.parent()
Template('{{a|{{b|{{c|{{d}}}}}}}}')
>>> _.parent()  # Returns None

Use the optional type_ argument if looking for ancestors of a specific type:

>>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')
>>> comment = parsed.comments[0]
>>> comment.ancestors(type_='ParserFunction')
[ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]

Compared with mwparserfromhell

mwparserfromhell is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that I needed it for. See mwparserfromhell’s issues 40, 42, 88, and other related issues. In many of those situation wikitextparser may be able to give you more acceptable results.

But if you need to

use Python 2
parse style tags like ‘’’bold’’’ and ‘’italics’’ (with some limitations of-course)
extract HTML entities

then mwparserfromhell or maybe other libraries will be the way to go. Also note that wikitextparser is still under heavy development and the API may change drastically in the future versions.

Of-course wikitextparser has its own unique features, too: Providing access to individual cells of each table, pretty-printing templates, and a few other advanced functions.

The tokenizer in mwparserfromhell is written in C. Tokenization in wikitextparser is mostly done using the regex library which is also in C. I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, wikitextparser has a decent performance and should able to compete and may even have little performance benefits in many situations. However if you are working with on-line data, any difference is usually negligible as the main bottleneck will be the network latency.

If you have had a chance to compare these libraries in terms of performance please share your experience by opening an issue on github.

Known issues and limitations

Syntax elements produced by a template transclusion cannot be detected by offline parsers.
Localized namespace names are unknown, so for example [[File:…]] links are treated as normal wikilinks. mwparserfromhell has similar issue, see #87 and #136. As a workaround, Pywikibot can be used for determining the namespace.
Linktrails are language dependant and are not supported. Also not supported by mwparserfromhell. However given the trail pattern and knowing that wikilink.span[1] is the ending position of a wikilink, it should be trivial to compute a WikiLink’s linktrail.
Templates adjacent to external links, are never considered part of the link. In reality, this depends on the contents of the template. Example: parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'
While MediaWiki recognizes only a finite number of tags and they are extension-dependent, the tags method returns anything that looks like an HTML tag. A configuration option might be added in the future to address this issue.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.56.4

May 14, 2025

0.56.3

Oct 18, 2024

0.56.2

Aug 3, 2024

0.56.1

Jul 19, 2024

0.56.0

Jun 28, 2024

0.55.14

Jun 28, 2024

0.55.13

Apr 19, 2024

0.55.12

Apr 12, 2024

0.55.11

Apr 9, 2024

0.55.10

Mar 19, 2024

0.55.9

Mar 11, 2024

0.55.8

Jan 15, 2024

0.55.7

Dec 21, 2023

0.55.6

Nov 25, 2023

0.55.5

Nov 8, 2023

0.55.4

Nov 7, 2023

0.55.3

Nov 7, 2023

0.55.2

Nov 5, 2023

0.55.1

Nov 5, 2023

0.55.0

Nov 5, 2023

0.54.1

Nov 3, 2023

0.54.1.dev0 pre-release

Nov 3, 2023

0.54.0

Aug 14, 2023

0.53.0

Jul 7, 2023

0.52.1

May 19, 2023

0.52.0

May 19, 2023

0.51.2

Apr 21, 2023

0.51.1

Oct 14, 2022

0.51.0

Sep 16, 2022

0.50.2

Sep 10, 2022

0.50.1

Aug 29, 2022

0.50.0

Aug 29, 2022

0.49.4

Jul 28, 2022

0.49.3

Jul 4, 2022

0.49.2

May 20, 2022

0.49.1

Apr 11, 2022

0.49.0

Apr 11, 2022

0.48.3

Apr 8, 2022

0.48.2

Mar 9, 2022

0.48.1

Mar 5, 2022

0.48.0

Dec 31, 2021

0.47.10.dev2 pre-release

Dec 31, 2021

0.47.10.dev1 pre-release

Dec 31, 2021

0.47.10.dev0 pre-release

Dec 31, 2021

0.47.9

Nov 26, 2021

0.47.8

Nov 19, 2021

0.47.7

Nov 11, 2021

0.47.6

Nov 5, 2021

0.47.5

Jun 3, 2021

0.47.4

Mar 21, 2021

0.47.3

Feb 14, 2021

0.47.2

Feb 13, 2021

0.47.1

Feb 5, 2021

0.47.0

Nov 28, 2020

0.46.0

Oct 14, 2020

0.45.3

Oct 11, 2020

0.45.2

Sep 30, 2020

0.45.1

Sep 25, 2020

0.45.0

Sep 14, 2020

0.44.1

Sep 10, 2020

0.44.0

Aug 28, 2020

0.43.2

Aug 19, 2020

0.43.1

Aug 14, 2020

0.43.0

Aug 14, 2020

0.42.3

Aug 12, 2020

0.42.2

Aug 4, 2020

0.42.1

Jul 19, 2020

0.42.0

Jul 19, 2020

0.41.0

Jul 12, 2020

0.40.0

Jul 12, 2020

0.39.0

Jul 12, 2020

0.38.2

Jul 9, 2020

0.38.1

Jul 9, 2020

0.38.0

Jul 7, 2020

0.37.12

Jul 2, 2020

0.37.11

Jul 1, 2020

0.37.10

Jul 1, 2020

0.37.9

Jul 1, 2020

0.37.8

Jul 1, 2020

0.37.7

Jun 30, 2020

0.37.6

Jun 30, 2020

0.37.5

Jun 29, 2020

0.37.4

Jun 29, 2020

0.37.3

Jun 26, 2020

0.37.2

Jun 21, 2020

0.37.1

Jun 11, 2020

0.37.0

Jun 6, 2020

0.37.0.dev1 pre-release

Jun 5, 2020

0.36.1

May 18, 2020

0.35.2

May 18, 2020

0.35.1

May 18, 2020

0.35.0

May 2, 2020

0.34.0

Mar 9, 2020

0.33.0

Mar 9, 2020

0.32.0

Feb 26, 2020

0.31.0

Feb 25, 2020

0.30.0

Feb 18, 2020

0.29.2

Feb 16, 2020

0.29.1

Jan 31, 2020

0.29.0

Jan 31, 2020

0.28.1

Nov 7, 2019

0.28.0

Aug 7, 2019

0.27.0

Aug 5, 2019

0.26.1

Jun 8, 2019

0.26.0

May 6, 2019

0.25.1

May 5, 2019

0.25.1.dev0 pre-release

May 5, 2019

0.25.0

May 5, 2019

0.24.4

May 3, 2019

0.24.3

Apr 14, 2019

0.24.2

Apr 14, 2019

0.24.1

Apr 2, 2019

0.24.0

Mar 25, 2019

0.23.0

Mar 20, 2019

0.22.1

Feb 1, 2019

0.22.1.dev0 pre-release

Aug 31, 2018

This version

0.22.0

Aug 31, 2018

0.21.5

May 29, 2018

0.21.4

Apr 2, 2018

0.21.3

Mar 30, 2018

0.21.2

Mar 8, 2018

0.21.2.dev0 pre-release

Mar 7, 2018

0.21.0

Mar 6, 2018

0.20.0

Feb 10, 2018

0.19.0

Feb 3, 2018

0.18.0

Jan 30, 2018

0.18.0.dev0 pre-release

Jan 30, 2018

0.17.4

Jan 26, 2018

0.17.3

Dec 31, 2017

0.17.3.dev0 pre-release

Dec 31, 2017

0.17.1

Jul 19, 2017

0.17.0

Jul 19, 2017

0.16.1

Jul 11, 2017

0.16.0

Jul 8, 2017

0.15.2

Jul 8, 2017

0.15.1

Jun 4, 2017

0.15.0

May 20, 2017

0.14.3

Feb 18, 2017

0.14.3.dev1 pre-release

Feb 13, 2017

0.14.1

Feb 9, 2017

0.14.0

Feb 7, 2017

0.13.6

Jan 28, 2017

0.13.5

Jan 9, 2017

0.13.4

Jan 4, 2017

0.13.2

Dec 27, 2016

0.13.1

Dec 27, 2016

0.13.0

Dec 26, 2016

0.13.0.dev1 pre-release

Dec 26, 2016

0.12.0

Dec 15, 2016

0.11.1.dev4 pre-release

Dec 10, 2016

0.11.1.dev3 pre-release

Dec 10, 2016

0.11.1.dev2 pre-release

Dec 10, 2016

0.11.1.dev1 pre-release

Dec 8, 2016

0.11.0

Dec 5, 2016

0.10.2

Nov 13, 2016

0.10.1

Nov 10, 2016

0.10.0

Nov 10, 2016

0.10.0.dev2 pre-release

Nov 10, 2016

0.10.0.dev1 pre-release

Nov 9, 2016

0.9.1

Nov 2, 2016

0.9.0

Oct 26, 2016

0.9.0.dev1 pre-release

Oct 26, 2016

0.8.8.dev1 pre-release

Oct 26, 2016

0.8.6.dev1 pre-release

Oct 24, 2016

0.8.5.dev1 pre-release

Oct 24, 2016

0.8.3

Sep 27, 2016

0.8.3dev pre-release

Sep 27, 2016

0.8.2

Sep 24, 2016

0.8.1

Sep 17, 2016

0.8.0

Aug 11, 2016

0.7.9

Jul 28, 2016

0.7.8

Jul 22, 2016

0.7.7

Jul 20, 2016

0.7.6

May 30, 2016

0.7.5

Apr 11, 2016

0.7.4

Mar 3, 2016

0.7.3

Feb 24, 2016

0.7.2

Feb 24, 2016

0.7.1

Feb 21, 2016

0.7.0

Feb 15, 2016

0.6.9

Nov 18, 2015

0.6.8

Nov 9, 2015

0.6.7

Nov 8, 2015

0.6.6

Nov 8, 2015

0.6.5

Nov 6, 2015

0.6.4

Nov 6, 2015

0.6.3

Oct 30, 2015

0.6.2

Oct 30, 2015

0.6.1

Oct 21, 2015

0.6.0

Oct 20, 2015

0.5.9

Oct 20, 2015

0.5.8

Oct 20, 2015

0.5.7

Oct 15, 2015

0.5.6.dev0 pre-release

Oct 15, 2015

0.5.5

Sep 30, 2015

0.5.3

Jun 24, 2015

0.5.2

Jun 24, 2015

0.5.1

Jun 24, 2015

0.4.8

Jun 20, 2015

0.4.7

Jun 19, 2015

0.4.6

Jun 16, 2015

0.4.5

May 20, 2015

0.4.4

May 18, 2015

0.4.3

May 17, 2015

0.4.2

May 2, 2015

0.4.1

May 1, 2015

0.4

Apr 26, 2015

0.3.1

Apr 21, 2015

0.2

Apr 16, 2015

0.1.3

Apr 10, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikitextparser-0.22.0.tar.gz (79.5 kB view details)

Uploaded Aug 31, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikitextparser-0.22.0-py3-none-any.whl (75.1 kB view details)

Uploaded Aug 31, 2018 Python 3

File details

Details for the file wikitextparser-0.22.0.tar.gz.

File metadata

Download URL: wikitextparser-0.22.0.tar.gz
Upload date: Aug 31, 2018
Size: 79.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for wikitextparser-0.22.0.tar.gz
Algorithm	Hash digest
SHA256	`4c529e109f1878497b26b766a5ac72a9e5ed9ccd5179e6216ae5076ecd7115a8`
MD5	`491d1f22bee4f32f3067561a1b0f13e5`
BLAKE2b-256	`d849518c2ce08564947e407d73698b6f515cab0db59f85747c4625aca4e16660`

See more details on using hashes here.

File details

Details for the file wikitextparser-0.22.0-py3-none-any.whl.

File metadata

Download URL: wikitextparser-0.22.0-py3-none-any.whl
Upload date: Aug 31, 2018
Size: 75.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.7.0

File hashes

Hashes for wikitextparser-0.22.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc8a1645ac0a264969341bca474b525ff77e67b044bcfa8c5d9a4a96bb53afa0`
MD5	`f051f41b96c423c21424ed878f781d98`
BLAKE2b-256	`d1467b8327588463080f225f6a070e356f97afd457577d4637321f532a2fd457`

See more details on using hashes here.

wikitextparser 0.22.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WikiTextParser

Installation

Usage

Templates

WikiLinks

Sections

Tables

Lists

Tags

Miscellaneous

Compared with mwparserfromhell

Known issues and limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes