Skip to main content

A parser for wikipedia citation templates

Project description

wikiciteparser Python tests PyPI

This Python library wraps Wikipedia's citation processing code (written in Lua) to parse citation templates. For instance, there are many different ways to specify the authors of a citation: this codes maps all of them to the same representation.

This is the underlying code of our "OAbot" (GitHub).

Distributed under the MIT license.

Builds on these Python libraries:

  • lupa
  • mwparserfromhell
  • mwxml
  • mwtypes

To install lupa from pypi.org, you will need libpython and liblua (libpython-dev and liblua5.2-dev on Debian-base Linux systems).

Install this package with pip (eg, in a virtual environment):

pip install wikiciteparser

Examples

Let's parse the reference section of this article:

import mwparserfromhell
from wikiciteparser.parser import parse_citation_template

mwtext = """
===Articles===
 * {{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | last2=Moser | first2=L. | title=Inverse and Complementary Sequences of Natural Numbers| doi=10.2307/2308078 | mr=0062777  | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=61 | issue=7 | pages=454–458 | year=1954 | jstor=2308078 | publisher=The American Mathematical Monthly, Vol. 61, No. 7}}
 * {{Citation | last1=Lambek | first1=J. | author1-link=Joachim Lambek | title=The Mathematics of Sentence Structure | year=1958 | journal=[[American Mathematical Monthly|The American Mathematical Monthly]] | issn=0002-9890 | volume=65 | pages=154–170 | doi=10.2307/2310058 | issue=3 | publisher=The American Mathematical Monthly, Vol. 65, No. 3 | jstor=1480361}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Bicommutators of nice injectives | doi=10.1016/0021-8693(72)90034-8 | mr=0301052  | year=1972 | journal=Journal of Algebra | issn=0021-8693 | volume=21 | pages=60–73}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=Localization and completion | doi=10.1016/0022-4049(72)90011-4 | mr=0320047  | year=1972 | journal=Journal of Pure and Applied Algebra | issn=0022-4049 | volume=2 | pages=343–370 | issue=4}}
 *{{Citation | last1=Lambek | first1=Joachim | author1-link=Joachim Lambek | title=A mathematician looks at Latin conjugation | mr=589163  | year=1979 | journal=Theoretical Linguistics | issn=0301-4428 | volume=6 | issue=2 | pages=221–234 | doi=10.1515/thli.1979.6.1-3.221}}
"""

wikicode = mwparserfromhell.parse(mwtext)
for tpl in wikicode.filter_templates():
   parsed = parse_citation_template(tpl)
   print(parsed)

Here is what you get:

{'PublisherName': 'The American Mathematical Monthly, Vol. 61, No. 7', 'Title': 'Inverse and Complementary Sequences of Natural Numbers', 'ID_list': {'DOI': '10.2307/2308078', 'ISSN': '0002-9890', 'MR': '0062777', 'JSTOR': '2308078'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}, {'last': 'Moser', 'first': 'L.'}], 'Date': '1954', 'Pages': '454-458'}
{'PublisherName': 'The American Mathematical Monthly, Vol. 65, No. 3', 'Title': 'The Mathematics of Sentence Structure', 'ID_list': {'DOI': '10.2307/2310058', 'ISSN': '0002-9890', 'JSTOR': '1480361'}, 'Periodical': 'The American Mathematical Monthly', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'J.'}], 'Date': '1958', 'Pages': '154-170'}
{'Title': 'Bicommutators of nice injectives', 'ID_list': {'DOI': '10.1016/0021-8693(72)90034-8', 'ISSN': '0021-8693', 'MR': '0301052'}, 'Periodical': 'Journal of Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '60-73'}
{'Title': 'Localization and completion', 'ID_list': {'DOI': '10.1016/0022-4049(72)90011-4', 'ISSN': '0022-4049', 'MR': '0320047'}, 'Periodical': 'Journal of Pure and Applied Algebra', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1972', 'Pages': '343-370'}
{'Title': 'A mathematician looks at Latin conjugation', 'ID_list': {'DOI': '10.1515/thli.1979.6.1-3.221', 'ISSN': '0301-4428', 'MR': '589163'}, 'Periodical': 'Theoretical Linguistics', 'Authors': [{'link': 'Joachim Lambek', 'last': 'Lambek', 'first': 'Joachim'}], 'Date': '1979', 'Pages': '221-234'}

Bulk Processing

There is also a bulk processing mode that can work with bulk XML files from dumps.wikimedia.org.

With wikiciteparser installed (or in the current python path), run a UNIX shell command like:

python3 -m wikiciteparser.bulk enwiki-20210801-pages-articles1.xml-p1p41242.bz2 | pv -l | gzip > enwiki-20210801-pages-articles1.xml-p1p41242.citations.json.gz

GNU/parallel can be used to run process many files concurrently on a single computer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiciteparser-0.3.0.tar.gz (81.3 kB view details)

Uploaded Source

Built Distribution

wikiciteparser-0.3.0-py2.py3-none-any.whl (81.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file wikiciteparser-0.3.0.tar.gz.

File metadata

  • Download URL: wikiciteparser-0.3.0.tar.gz
  • Upload date:
  • Size: 81.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wikiciteparser-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a240263ce716db60783d26b2f017772647b55f433239173640de4ae9c8b66f0f
MD5 35d5dc1cb7e3fe2b52b5e9d927988759
BLAKE2b-256 19c255d21c6cc65d5251dd7449e0b5c107c888b3681e7c957c0028653293ceaa

See more details on using hashes here.

File details

Details for the file wikiciteparser-0.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: wikiciteparser-0.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 81.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wikiciteparser-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 da0b288b3d636e512039c3cbf93e3131df8163d68bf85c89a6accb4b19aa105c
MD5 5241cfbf76bad09ef4441297339d8500
BLAKE2b-256 1f0f97579aae13fd56fb7cf92fc01b5f7f54c84b8f4913f268056ec373686cfc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page