A Python package for extracting and processing Wikipedia data

These details have not been verified by PyPI

Project links

Homepage

Project description

Wiki dump extractor

A python library to extract and analyze pages from a wiki dump.

This library is used in particular in the Landnotes project to extract and analyze pages from the Wikipedia dump.

The project is hosted on GitHub an the HTML documentation is available here.

Scope

Make the wikipedia dumps easier to work with:

Extract pages from a wiki dump
Be easy to install and run
Be fast (can iterate over 50,000 pages / secong using Avro)
Be memory efficient
Allow for batch processing and parallel processing

Provide utilities for page analysis:

Date parsing
Section extraction
Text cleaning
and more.

Usage

To simply iterate over the pages in the dump:

from wiki_dump_extractor import WikiDumpExtractor

dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
for page in extractor.iter_pages(limit=1000):
    print(page.title)

To extract the pages by batches (here we save the pages separate CSV files):

from wiki_dump_extractor import WikiDumpExtractor

dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
batches = extractor.iter_page_batches(batch_size=1000, limit=10)
for i, batch in enumerate(batches):
    df = pandas.DataFrame([page.to_dict() for page in batch])
    df.to_csv(f"batch_{i}.csv")

Converting the dump to Avro

There are many reasons why you might want to convert the dump to Avro. The original xml.bz2 dump is 22Gb but very slow to read from (250/s), the uncompressed dump is 107Gb, relatively fast to read (this library uses lxml which reads thousands of pages per second), however 50% of the pages in there are empty redirect pages.

The following code converts the batch to a 28G avro dump that only contains the 12 million real pages, stores redirects in a fast LMDB database, and creates an index for quick page lookups. The operation takes ~40 minutes depending on your machine.

from wiki_dump_extractor import WikiXmlDumpExtractor

file_path = "enwiki-20250201-pages-articles-multistream.xml"
extractor = WikiXmlDumpExtractor(file_path=file_path)
ignored_fields = ["timestamp", "page_id", "revision_id", "redirect_title"]
extractor.extract_pages_to_avro(
    output_file="wiki_dump.avro",
    redirects_db_path="redirects.lmdb",  # LMDB database for fast redirect lookups
    ignored_fields=ignored_fields,
)

Then index the pages for fast lookups:

from wiki_dump_extractor import WikiAvroDumpExtractor

extractor = WikiAvroDumpExtractor(file_path="wiki_dump.avro")
extractor.index_pages(page_index_db="page_index.lmdb")

Later on, read the Avro file and use redirects and index as follows (reads the 12 million pages in ~3-4 minutes depending on your machine):

from wiki_dump_extractor import WikiAvroDumpExtractor

# Create extractor
extractor = WikiAvroDumpExtractor(
    file_path="wiki_dump.avro",
    index_dir="page_index.lmdb"  # Use the index for faster lookups
)

# Get pages with automatic redirect resolution
pages = extractor.get_page_batch_by_title(
    ["Page Title 1", "Page Title 2"]
)

Installation

pip install wiki-dump-extractor

Or from the source in development mode:

pip install -e .

To use the LLM-specific module (that would be mostly if you are on a project like Landnotes), use

pip install wiki-dump-extractor[llm]

Or locally:

pip install -e ".[llm]"

To install with tests, use pip install -e ".[dev]" then run the tests with pytest in the root directory.

Requirements for running the LLM utils

# Add the Cloud SDK distribution URI as a package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Import the Google Cloud public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jun 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_dump_extractor-0.1.0.tar.gz (27.4 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wiki_dump_extractor-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file wiki_dump_extractor-0.1.0.tar.gz.

File metadata

Download URL: wiki_dump_extractor-0.1.0.tar.gz
Upload date: Jun 2, 2025
Size: 27.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wiki_dump_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0f4e565af7d816f6cdb1054c5bb2c20761c50b283db65c0be0b3b4bf0bf71d28`
MD5	`070587feb2e5a96962a33170fc504a6c`
BLAKE2b-256	`2d51227b2a5bad89ef12d81087f21ea52bd8f2cf5ec6fc16ea1845d382dc8676`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wiki_dump_extractor-0.1.0.tar.gz:

Publisher: publish.yml on Zulko/wiki_dump_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wiki_dump_extractor-0.1.0.tar.gz
- Subject digest: 0f4e565af7d816f6cdb1054c5bb2c20761c50b283db65c0be0b3b4bf0bf71d28
- Sigstore transparency entry: 227029359
- Sigstore integration time: Jun 2, 2025
Source repository:
- Permalink: Zulko/wiki_dump_extractor@4009f00fbabd198e2e353f6ee3cc9e183476c4b5
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Zulko
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4009f00fbabd198e2e353f6ee3cc9e183476c4b5
- Trigger Event: release

File details

Details for the file wiki_dump_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: wiki_dump_extractor-0.1.0-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wiki_dump_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df95e87188011c60c99e9553d575620e313cf4fd0e0a9036e42df1b9331e7478`
MD5	`b25592231101f35fafd32cc6e4aa6f4e`
BLAKE2b-256	`a7eaefd144687b2db6780307c68ba4d6185382fbd7edbbf377e7b0621baf6fce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wiki_dump_extractor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Zulko/wiki_dump_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wiki_dump_extractor-0.1.0-py3-none-any.whl
- Subject digest: df95e87188011c60c99e9553d575620e313cf4fd0e0a9036e42df1b9331e7478
- Sigstore transparency entry: 227029360
- Sigstore integration time: Jun 2, 2025
Source repository:
- Permalink: Zulko/wiki_dump_extractor@4009f00fbabd198e2e353f6ee3cc9e183476c4b5
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Zulko
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4009f00fbabd198e2e353f6ee3cc9e183476c4b5
- Trigger Event: release

wiki-dump-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wiki dump extractor

Scope

Usage

Converting the dump to Avro

Installation

Requirements for running the LLM utils

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance