Skip to main content

A parser for wikipedia citation templates

Project description

wikiciteparser Python tests PyPI

This Python library wraps Wikipedia's citation processing code (written in Lua) to parse citation templates. For instance, there are many different ways to specify the authors of a citation: this codes maps all of them to the same representation.

This is the underlying code of our "OAbot" (GitHub).

Distributed under the MIT license.

Builds on these Python libraries:

  • lupa
  • mwparserfromhell
  • mwxml
  • mwtypes

To install lupa from pypi.org, you will need libpython and liblua (libpython-dev and liblua5.2-dev on Debian-base Linux systems).

Install this package with pip (eg, in a virtual environment):

pip install wikiciteparser

Examples

Let's parse the reference section of this article:

import mwparserfromhell
from wikiciteparser.parser import parse_citation_template

mwtext = """
===Articles===
 * {{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | last2=Moser | first2=L. | title=Inverse and Complementary Sequences of Natural Numbers| doi=10.2307/2308078 | mr=0062777  | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=61 | issue=7 | pages=454–458 | year=1954 | jstor=2308078 | publisher=The American Mathematical Monthly, Vol. 61, No. 7}}
 * {{Citation | last1=Lambek | first1=J. | author1-link=Joachim Lambek | title=The Mathematics of Sentence Structure | year=1958 | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=65 | pages=154–170 | doi=10.2307/2310058 | issue=3 | publisher=The American Mathematical Monthly, Vol. 65, No. 3 | jstor=1480361}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Bicommutators of nice injectives | doi=10.1016/0021-8693(72)90034-8 | mr=0301052  | year=1972 | journal=Journal of Algebra | issn=0021-8693 | volume=21 | pages=60–73}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Localization and completion | doi=10.1016/0022-4049(72)90011-4 | mr=0320047  | year=1972 | journal=Journal of Pure and Applied Algebra | issn=0022-4049 | volume=2 | pages=343–370 | issue=4}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=A mathematician looks at Latin conjugation | mr=589163  | year=1979 | journal=Theoretical Linguistics | issn=0301-4428 | volume=6 | issue=2 | pages=221–234 | doi=10.1515/thli.1979.6.1-3.221}}
"""

wikicode = mwparserfromhell.parse(mwtext)
for tpl in wikicode.filter_templates():
   parsed = parse_citation_template(tpl)
   print(parsed)

Here is what you get:

{'PublisherName': 'The American Mathematical Monthly, Vol. 61, No. 7', 'Title': 'Inverse and Complementary Sequences of Natural Numbers', 'ID_list': {'DOI': '10.2307/2308078', 'ISSN': '0002-9890', 'MR': '0062777', 'JSTOR': '2308078'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}, {'last': 'Moser', 'first': 'L.'}], 'Date': '1954', 'Pages': '454-458'}
{'PublisherName': 'The American Mathematical Monthly, Vol. 65, No. 3', 'Title': 'The Mathematics of Sentence Structure', 'ID_list': {'DOI': '10.2307/2310058', 'ISSN': '0002-9890', 'JSTOR': '1480361'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'J.'}], 'Date': '1958', 'Pages': '154-170'}
{'Title': 'Bicommutators of nice injectives', 'ID_list': {'DOI': '10.1016/0021-8693(72)90034-8', 'ISSN': '0021-8693', 'MR': '0301052'}, 'Periodical': 'Journal of Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '60-73'}
{'Title': 'Localization and completion', 'ID_list': {'DOI': '10.1016/0022-4049(72)90011-4', 'ISSN': '0022-4049', 'MR': '0320047'}, 'Periodical': 'Journal of Pure and Applied Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '343-370'}
{'Title': 'A mathematician looks at Latin conjugation', 'ID_list': {'DOI': '10.1515/thli.1979.6.1-3.221', 'ISSN': '0301-4428', 'MR': '589163'}, 'Periodical': 'Theoretical Linguistics', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1979', 'Pages': '221-234'}

Bulk Processing

There is also a bulk processing mode that can work with bulk XML files from dumps.wikimedia.org.

With wikiciteparser installed (or in the current python path), run a UNIX shell command like:

python3 -m wikiciteparser.bulk enwiki-20210801-pages-articles1.xml-p1p41242.bz2 | pv -l | gzip > enwiki-20210801-pages-articles1.xml-p1p41242.citations.json.gz

GNU/parallel can be used to run process many files concurrently on a single computer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiciteparser-0.3.0.tar.gz (81.3 kB view hashes)

Uploaded source

Built Distribution

wikiciteparser-0.3.0-py2.py3-none-any.whl (81.1 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page