A collection of scripts and utilities for extracting citations to academic literature from Wikipedia's XML database dumps.
Project description
This project contains a utility for extracting academic citation identifiers.
NOTE: As one of its dependencies (Mediawiki-Utilities) requires Python 3 so does mwcites.
pip install mwcites
Usage
There’s really only one utility in this package called mwcitations.
$ mwcitations extract enwiki-20150112-pages-meta-history*.xml*.bz2 > citations.tsv
Documentation
Documentation is provided $ mwcitations extract -h.
Extracts academic citations from articles from the history of Wikipedia
articles by processing a pages-meta-history XML dump and matching regular
expressions to revision content.
Currently supported identifiers include:
* PubMed
* DOI
* ISBN
* arXiv
Outputs a TSV file with the following fields:
* page_id: The identifier of the Wikipedia article (int), e.g. 1325125
* page_title: The title of the Wikipedia article (utf-8), e.g. Club cell
* rev_id: The Wikipedia revision where the citation was first added (int),
e.g. 282470030
* timestamp: The timestamp of the revision where the citation was first
added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z
* type: The type of identifier, e.g. pmid, pmcid, doi, isbn or arxiv
* id: The id of the cited scholarly article (utf-8),
e.g 10.1183/09031936.00213411
Usage:
mwcites extract -h | --help
mwcites extract <dump_file>...
Options:
-h --help Shows this documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
mwcites-0.2.0.zip
(17.7 kB
view details)
mwcites-0.2.0.tar.gz
(10.5 kB
view details)
File details
Details for the file mwcites-0.2.0.zip.
File metadata
- Download URL: mwcites-0.2.0.zip
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7670124a4ab55b856022949f0046a05d38a12d14a6f156e1a654747987485e2e
|
|
| MD5 |
93e15cb66654777667dd497edb91ca4f
|
|
| BLAKE2b-256 |
ecd5e9df07872b866e44a7ab90e0c6fd472e531a5107393e6a1233a58c6017a1
|
File details
Details for the file mwcites-0.2.0.tar.gz.
File metadata
- Download URL: mwcites-0.2.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8229377609e2d9ebcd3d8dc3ba8f8283a05029f91062a16c6b715dbd6bd7a536
|
|
| MD5 |
035947b31aaaf640e12828e90a18b964
|
|
| BLAKE2b-256 |
28c89bb13c1198d47aa59209b5f3551ad3ddd02eaa7de4c9d44fe94e4f634e11
|