An efficient Wikipedia XML dump extractor which converts each page to JSON.
Project description
Wiki Dump Extractor
An efficient extractor for Wikipedia XML dump files that extracts structured content into JSON format. Designed for memory-efficient processing of large Wikipedia dumps with structured output preservation and utilized mwparserfromhell for source parsing.
Features
- Parallel Processing: Multi-core support for faster processing of large datasets
- Resumable Processing: Saves progress and can resume from interrupted extractions
- Structured Output: Preserves Wikipedia's hierarchical structure (sections, tables of contents, infoboxes)
- Memory Efficient: Uses streaming XML parsing to handle 10GB+ dump files without memory growth
- Reference Handling: Deduplicates and renumbers citation references
- Redirection Support: Properly handles Wikipedia redirect pages
Installation
pip install wiki-dump-extractor-json
pip install ujson # optional (for better performance)
Dependencies
- Python 3.10 or higher
- tqdm (for progress bars)
- mwparserfromhell (for parsing source)
- python-dateutil (for date parsing)
Usage
Command Line Interface
The package provides a command-line tool wiki-dump-extractor-json:
# Parse a Wikipedia XML dump file
wiki-dump-extractor-json -o output_dir pages.xml
# Run a benchmark test
wiki-dump-extractor-json --benchmark pages.xml
# Look up a specific article from extracted data
wiki-dump-extractor-json --lookup "title" output_dir
# "python -m" is also available
python -m wiki_dump_extractor_json ...
Python API
Parse Wikipedia XML Dumps
from wiki_dump_extractor_json import parse_xml_dump
# Stream parse XML dump file
with open('pages.xml', 'rb') as f:
for page in parse_xml_dump(f):
print(f"Page: {page.title}")
print(f"Timestamp: {page.timestamp}")
# Process page.source as needed
Parse Individual Wikipedia Articles
from wiki_dump_extractor_json import parse_source
# Parse a single Wikipedia article source
source = """== Section Title ==
Some content with a reference<ref>Reference text</ref>.
Another paragraph."""
parsed = parse_source(source)
print(f"Leading text: {parsed['leading']}")
print(f"References: {parsed['references']}")
print(f"Subsections: {len(parsed['subSections'])}")
Lookup Extracted Articles
from wiki_dump_extractor_json import lookup_from_extracted
title = input('Input title: ')
# Look up a specific article from extracted data
article_data = lookup_from_extracted('output_dir', title)
print(f"Article title: {article_data['title']}")
print(f"Article sections: {len(article_data['subSections'])}")
Output Format
The parser extracts Wikipedia articles into the following JSON structure:
{
"title": "wiki-dump-extractor-json",
"timestamp": 1532158619.0,
"leading": "wiki-dump-extractor-json is an efficient parser for Wikimedia XML dumps.\n\n",
"infobox": null,
"toc": [
{
"title": "Sub section",
"sub": []
}
],
"subSections": [
{
"title": "Sub section",
"leading": "== Sub section ==\n\nSub section part.",
"subSections": []
}
],
"references": [],
"redirectedTo": null
}
Fields
title: The article titletimestamp: Last edit timestamp as Unix timestampleading: The lead section content (before first section)infobox: The infobox template content if presenttoc: Table of contents hierarchysubSections: Array of article sections with hierarchical structurereferences: List of unique reference textsredirectedTo: Target title if this is a redirect page
Output Path Structure
Extracted data is organized as:
output_directory/
├── index.json # Progress tracking and article index
├── 00/
│ ├── 0.jsonl # JSON Lines files with articles
│ ├── 1.jsonl
│ └── ...
├── 01/
│ ├── 0.jsonl
│ └── ...
└── ...
Each .jsonl file contains one JSON object per line, each representing one Wikipedia article.
The format of index.json (for details, see wiki_dump_extractor_json/extractor.py):
{
"pages": {"Template:Wikipedialang": [0, 0, 0],
"wiki-dump-extractor-json": [0, 0, 1]},
"_progress": [0, 0, 536, 2]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiki_dump_extractor_json-1.0.5.tar.gz.
File metadata
- Download URL: wiki_dump_extractor_json-1.0.5.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb2d1ccb87c6caa5ef4a30a8ecf30e608c5fd47d633d6abaff596022ee7ab6d7
|
|
| MD5 |
10ec24ec05f31d70d4219485a7a12b96
|
|
| BLAKE2b-256 |
45f67e177978175a9660e552bef46e220a35139d5be400d11da5e43470208530
|
File details
Details for the file wiki_dump_extractor_json-1.0.5-py3-none-any.whl.
File metadata
- Download URL: wiki_dump_extractor_json-1.0.5-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c584cdbf3d1ee01f13c0596d284bac489ff005248b926f39b025769910ec532f
|
|
| MD5 |
579dcc2eb54fc4259441031f0cb2ca22
|
|
| BLAKE2b-256 |
15bc9d78e5a5722c4b2035de2f5278e792c82782abbbfaf538ffd2f6aeb53373
|