Skip to main content

Python package for working with MediaWiki XML content dumps

Project description

# mediawiki-dump
[![Build Status](https://travis-ci.org/macbre/mediawiki-dump.svg?branch=master)](https://travis-ci.org/macbre/mediawiki-dump)

```
pip install mediawiki_dump
```

[Python3 package](https://pypi.org/project/mediawiki_dump/) for working with [MediaWiki XML content dumps](https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)).

Wikipedia (bz2 compressed) and Wikia (7zip) content dumps are supported.

## Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install [`libarchive`](http://libarchive.org/):

```
sudo apt install libarchive-dev
```

## API

### Tokenizer

Allows you to clean up the wikitext:

```python
from mediawiki_dump.tokenizer import clean
clean('[[Foo|bar]] is a link')
'bar is a link'
```

And then tokenize the text:

```python
from mediawiki_dump.tokenizer import tokenize
tokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')
['juni', 'varð', 'kunngjørt', 'at', 'Svínoyar', 'kommuna', 'verður', 'løgd', 'saman', 'við', 'Klaksvíkar', 'kommunu', 'eftir', 'komandi', 'bygdaráðsval']
```

### Dump reader

Fetch and parse dumps (using a local file cache):

```python
from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReader

dump = WikipediaDump('fo')
pages = DumpReader().read(dump)

[title for _, _, title, *rest in pages][:10]

['Main Page', 'Brúkari:Jon Harald Søby', 'Forsíða', 'Ormurin Langi', 'Regin smiður', 'Fyrimynd:InterLingvLigoj', 'Heimsyvirlýsingin um mannarættindi', 'Bólkur:Kvæði', 'Bólkur:Yrking', 'Kjak:Forsíða']
```

`read` method yields the following per-revision information: `namespace`, `page_id`, `title`, `content`, `revision_id`, `timestamp`, `contributor` (`None` for anonymous edits).

By using `DumpReaderArticles` class you can read article pages only:

```python
import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikipediaDump('fo')
pages = DumpReaderArticles().read(dump)

print([title for _, _, title, *rest in pages][:25])
```

Will give you:

```
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from <https://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2>...
INFO:WikipediaDump:HTTP 200 (14105 kB fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']
```

## Reading Wikia's dumps

```python
import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikiaDump('plnordycka')
pages = DumpReaderArticles().read(dump)

print([title for _, _, title, *rest in pages][:25])
```

Will give you:

```
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from <https://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z>...
INFO:WikiaDump:HTTP 200 (129 kB fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']
```

## Fetching full history

Pass `full_history` to `BaseDump` constructor to fetch the XML content dump with full history:

```python
import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikiaDump('macbre', full_history=True) # fetch full history, including old revisions
pages = DumpReaderArticles().read(dump)

print('\n'.join(['%s %s (%s)' % (str(timestamp), title, author) for _, _, title, _, _, timestamp, author in pages]))
```

Will give you:

```
INFO:DumpReaderArticles:Parsing completed, entries found: 384
2016-10-12 19:51:06+00:00 Macbre Wiki (Default)
2016-10-12 19:51:05+00:00 Macbre Wiki (Wikia)
2016-11-04 10:33:20+00:00 Macbre Wiki (Macbre)
2016-11-04 10:37:17+00:00 Macbre Wiki (FandomBot)
2017-01-25 14:47:37+00:00 Macbre Wiki (FandomBot)
2017-04-10 11:20:25+00:00 Macbre Wiki (Ryba777)
2017-04-10 11:21:20+00:00 Macbre Wiki (Ryba777)
2018-03-07 12:51:12+00:00 Macbre Wiki (Macbre)
2016-10-12 19:51:05+00:00 Main Page (Wikia)
2016-11-08 10:15:33+00:00 FooBar (None)
2016-11-08 10:15:49+00:00 FooBar (None)
...
2018-06-05 11:45:44+00:00 YouTube tag (FANDOMbot)
2018-06-06 08:51:24+00:00 Maps (Macbre)
2018-06-07 08:17:13+00:00 Maps (Macbre)
2018-06-07 08:17:36+00:00 Maps (Macbre)
2018-07-24 14:52:20+00:00 Scary transclusion (Macbre)
2018-09-11 14:04:15+00:00 Lua (Macbre)
2018-09-11 14:14:24+00:00 Lua (Macbre)
2018-09-11 14:14:37+00:00 Lua (Macbre)
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for mediawiki-dump, version 0.3
Filename, size File type Python version Upload date Hashes
Filename, size mediawiki_dump-0.3.tar.gz (8.3 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page