Skip to main content

An efficient Wikipedia XML dump extractor which converts each page to JSON.

Project description

GitHub release License: MIT

Wiki Dump Extractor

An efficient extractor for Wikipedia XML dump files that extracts structured content into JSON format. Designed for memory-efficient processing of large Wikipedia dumps with structured output preservation and utilized mwparserfromhell for source parsing.

Features

  • Parallel Processing: Multi-core support for faster processing of large datasets
  • Resumable Processing: Saves progress and can resume from interrupted extractions
  • Structured Output: Preserves Wikipedia's hierarchical structure (sections, tables of contents, infoboxes)
  • Memory Efficient: Uses streaming XML parsing to handle 10GB+ dump files without memory growth
  • Reference Handling: Deduplicates and renumbers citation references
  • Redirection Support: Properly handles Wikipedia redirect pages

Installation

pip install wiki-dump-extractor-json
pip install ujson # optional (for better performance)

Dependencies

  • Python 3.10 or higher
  • tqdm (for progress bars)
  • mwparserfromhell (for parsing source)
  • python-dateutil (for date parsing)

Usage

Command Line Interface

The package provides a command-line tool wiki-dump-extractor-json:

# Parse a Wikipedia XML dump file
wiki-dump-extractor-json -o output_dir pages.xml

# Run a benchmark test
wiki-dump-extractor-json --benchmark pages.xml

# Look up a specific article from extracted data
wiki-dump-extractor-json --lookup "title" output_dir

# "python -m" is also available
python -m wiki_dump_extractor_json ...

Python API

Parse Wikipedia XML Dumps

from wiki_dump_extractor_json import parse_xml_dump

# Stream parse XML dump file
with open('pages.xml', 'rb') as f:
    for page in parse_xml_dump(f):
        print(f"Page: {page.title}")
        print(f"Timestamp: {page.timestamp}")
        # Process page.source as needed

Parse Individual Wikipedia Articles

from wiki_dump_extractor_json import parse_source

# Parse a single Wikipedia article source
source = """== Section Title ==
Some content with a reference<ref>Reference text</ref>.
Another paragraph."""

parsed = parse_source(source)
print(f"Leading text: {parsed['leading']}")
print(f"References: {parsed['references']}")
print(f"Subsections: {len(parsed['subSections'])}")

Lookup Extracted Articles

from wiki_dump_extractor_json import lookup_from_extracted

title = input('Input title: ')
# Look up a specific article from extracted data
article_data = lookup_from_extracted('output_dir', title)
print(f"Article title: {article_data['title']}")
print(f"Article sections: {len(article_data['subSections'])}")

Output Format

The parser extracts Wikipedia articles into the following JSON structure:

{
  "title": "wiki-dump-extractor-json",
  "timestamp": 1532158619.0,
  "leading": "wiki-dump-extractor-json is an efficient parser for Wikimedia XML dumps.\n\n",
  "infobox": null,
  "toc": [
    {
      "title": "Sub section",
      "sub": []
    }
  ],
  "subSections": [
    {
      "title": "Sub section",
      "leading": "== Sub section ==\n\nSub section part.",
      "subSections": []
    }
  ],
  "references": [],
  "redirectedTo": null
}

Fields

  • title: The article title
  • timestamp: Last edit timestamp as Unix timestamp
  • leading: The lead section content (before first section)
  • infobox: The infobox template content if present
  • toc: Table of contents hierarchy
  • subSections: Array of article sections with hierarchical structure
  • references: List of unique reference texts
  • redirectedTo: Target title if this is a redirect page

Output Path Structure

Extracted data is organized as:

output_directory/
├── index.json          # Progress tracking and article index
├── 00/
│   ├── 0.jsonl         # JSON Lines files with articles
│   ├── 1.jsonl
│   └── ...
├── 01/
│   ├── 0.jsonl
│   └── ...
└── ...

Each .jsonl file contains one JSON object per line, each representing one Wikipedia article.
The format of index.json (for details, see wiki_dump_extractor_json/extractor.py):

{
    "pages": {"Template:Wikipedialang": [0, 0, 0],
              "wiki-dump-extractor-json": [0, 0, 1]},
    "_progress": [0, 0, 536, 2]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_dump_extractor_json-1.0.5.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiki_dump_extractor_json-1.0.5-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file wiki_dump_extractor_json-1.0.5.tar.gz.

File metadata

  • Download URL: wiki_dump_extractor_json-1.0.5.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for wiki_dump_extractor_json-1.0.5.tar.gz
Algorithm Hash digest
SHA256 bb2d1ccb87c6caa5ef4a30a8ecf30e608c5fd47d633d6abaff596022ee7ab6d7
MD5 10ec24ec05f31d70d4219485a7a12b96
BLAKE2b-256 45f67e177978175a9660e552bef46e220a35139d5be400d11da5e43470208530

See more details on using hashes here.

File details

Details for the file wiki_dump_extractor_json-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for wiki_dump_extractor_json-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c584cdbf3d1ee01f13c0596d284bac489ff005248b926f39b025769910ec532f
MD5 579dcc2eb54fc4259441031f0cb2ca22
BLAKE2b-256 15bc9d78e5a5722c4b2035de2f5278e792c82782abbbfaf538ffd2f6aeb53373

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page