An efficient Wikipedia XML dump extractor which converts each page to JSON.

These details have not been verified by PyPI

Project links

Homepage

Project description

Wiki Dump Extractor

An efficient extractor for Wikipedia XML dump files that extracts structured content into JSON format. Designed for memory-efficient processing of large Wikipedia dumps with structured output preservation and utilized mwparserfromhell for source parsing.

Features

Parallel Processing: Multi-core support for faster processing of large datasets
Resumable Processing: Saves progress and can resume from interrupted extractions
Structured Output: Preserves Wikipedia's hierarchical structure (sections, tables of contents, infoboxes)
Memory Efficient: Uses streaming XML parsing to handle 10GB+ dump files without memory growth
Reference Handling: Deduplicates and renumbers citation references
Redirection Support: Properly handles Wikipedia redirect pages

Installation

pip install wiki-dump-extractor-json

Dependencies

Python 3.10 or higher
tqdm (for progress bars)
mwparserfromhell (for parsing source)
python-dateutil (for date parsing)

Usage

Command Line Interface

The package provides a command-line tool wiki-dump-extractor-json:

# Parse a Wikipedia XML dump file
wiki-dump-extractor-json -o output_dir pages.xml

# Run a benchmark test
wiki-dump-extractor-json --benchmark pages.xml

# Look up a specific article from extracted data
wiki-dump-extractor-json --lookup "title" output_dir

# "python -m" is also available
python -m wiki_dump_extractor_json ...

Python API

Parse Wikipedia XML Dumps

from wiki_dump_extractor_json import parse_xml_dump

# Stream parse XML dump file
with open('pages.xml', 'rb') as f:
    for page in parse_xml_dump(f):
        print(f"Page: {page.title}")
        print(f"Timestamp: {page.timestamp}")
        # Process page.source as needed

Parse Individual Wikipedia Articles

from wiki_dump_extractor_json import parse_source

# Parse a single Wikipedia article source
source = """== Section Title ==
Some content with a reference<ref>Reference text</ref>.
Another paragraph."""

parsed = parse_source(source)
print(f"Leading text: {parsed['leading']}")
print(f"References: {parsed['references']}")
print(f"Subsections: {len(parsed['subSections'])}")

Lookup Extracted Articles

from wiki_dump_extractor_json import lookup_from_extracted

title = input('Input title: ')
# Look up a specific article from extracted data
article_data = lookup_from_extracted('output_dir', title)
print(f"Article title: {article_data['title']}")
print(f"Article sections: {len(article_data['subSections'])}")

Output Format

The parser extracts Wikipedia articles into the following JSON structure:

{
  "title": "wiki-dump-extractor-json",
  "timestamp": 1532158619.0,
  "leading": "wiki-dump-extractor-json is an efficient parser for Wikimedia XML dumps.\n\n",
  "infobox": null,
  "toc": [
    {
      "title": "Sub section",
      "sub": []
    }
  ],
  "subSections": [
    {
      "title": "Sub section",
      "leading": "== Sub section ==\n\nSub section part.",
      "subSections": []
    }
  ],
  "references": [],
  "redirectedTo": null
}

Fields

title: The article title
timestamp: Last edit timestamp as Unix timestamp
leading: The lead section content (before first section)
infobox: The infobox template content if present
toc: Table of contents hierarchy
subSections: Array of article sections with hierarchical structure
references: List of unique reference texts
redirectedTo: Target title if this is a redirect page

Output Path Structure

Extracted data is organized as:

output_directory/
├── index.json          # Progress tracking and article index
├── 00/
│   ├── 0.jsonl         # JSON Lines files with articles
│   ├── 1.jsonl
│   └── ...
├── 01/
│   ├── 0.jsonl
│   └── ...
└── ...

Each .jsonl file contains one JSON object per line, each representing one Wikipedia article.
The format of index.json (for details, see wiki_dump_extractor_json/extractor.py):

{
    "pages": {"Template:Wikipedialang": [0, 0, 0],
              "wiki-dump-extractor-json": [0, 0, 1]},
    "_progress": [0, 0, 536, 2]
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.5

Feb 24, 2026

This version

1.0.4

Jan 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_dump_extractor_json-1.0.4.tar.gz (10.5 kB view details)

Uploaded Jan 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wiki_dump_extractor_json-1.0.4-py3-none-any.whl (10.1 kB view details)

Uploaded Jan 22, 2026 Python 3

File details

Details for the file wiki_dump_extractor_json-1.0.4.tar.gz.

File metadata

Download URL: wiki_dump_extractor_json-1.0.4.tar.gz
Upload date: Jan 22, 2026
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for wiki_dump_extractor_json-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`2b95e7101e60f863e639522b7c3cf82a35ad133bf6420bb066e484cc6a1b713c`
MD5	`e117fa48619fd93b0020f4097d95f710`
BLAKE2b-256	`a24502fe52916667f59bf758caa76c21894942429bf402de7285331cbed19ea2`

See more details on using hashes here.

File details

Details for the file wiki_dump_extractor_json-1.0.4-py3-none-any.whl.

File metadata

Download URL: wiki_dump_extractor_json-1.0.4-py3-none-any.whl
Upload date: Jan 22, 2026
Size: 10.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for wiki_dump_extractor_json-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9462bd78b84a7304cc46432b28b9fc74acd198dc73de227b437e9506892e969a`
MD5	`316d32222b5a28b05722ed237e1c4771`
BLAKE2b-256	`26868f3ff284d0129626374299e81011847d259e53006e67ab91977514600363`

See more details on using hashes here.

wiki-dump-extractor-json 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wiki Dump Extractor

Features

Installation

Dependencies

Usage

Command Line Interface

Python API

Parse Wikipedia XML Dumps

Parse Individual Wikipedia Articles

Lookup Extracted Articles

Output Format

Fields

Output Path Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes