Skip to main content

A Python package for extracting and processing Wikipedia data

Project description

Wiki dump extractor

A python library to extract and analyze pages from a wiki dump.

This library is used in particular in the Landnotes project to extract and analyze pages from the Wikipedia dump.

The project is hosted on GitHub an the HTML documentation is available here.

Scope

Make the wikipedia dumps easier to work with:

  • Extract pages from a wiki dump
  • Be easy to install and run
  • Be fast (can iterate over 50,000 pages / secong using Avro)
  • Be memory efficient
  • Allow for batch processing and parallel processing

Provide utilities for page analysis:

  • Date parsing
  • Section extraction
  • Text cleaning
  • and more.

Usage

To simply iterate over the pages in the dump:

from wiki_dump_extractor import WikiDumpExtractor

dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
for page in extractor.iter_pages(limit=1000):
    print(page.title)

To extract the pages by batches (here we save the pages separate CSV files):

from wiki_dump_extractor import WikiDumpExtractor

dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
batches = extractor.iter_page_batches(batch_size=1000, limit=10)
for i, batch in enumerate(batches):
    df = pandas.DataFrame([page.to_dict() for page in batch])
    df.to_csv(f"batch_{i}.csv")

Converting the dump to Avro

There are many reasons why you might want to convert the dump to Avro. The original xml.bz2 dump is 22Gb but very slow to read from (250/s), the uncompressed dump is 107Gb, relatively fast to read (this library uses lxml which reads thousands of pages per second), however 50% of the pages in there are empty redirect pages.

The following code converts the batch to a 28G avro dump that only contains the 12 million real pages, stores redirects in a fast LMDB database, and creates an index for quick page lookups. The operation takes ~40 minutes depending on your machine.

from wiki_dump_extractor import WikiXmlDumpExtractor

file_path = "enwiki-20250201-pages-articles-multistream.xml"
extractor = WikiXmlDumpExtractor(file_path=file_path)
ignored_fields = ["timestamp", "page_id", "revision_id", "redirect_title"]
extractor.extract_pages_to_avro(
    output_file="wiki_dump.avro",
    redirects_db_path="redirects.lmdb",  # LMDB database for fast redirect lookups
    ignored_fields=ignored_fields,
)

Then index the pages for fast lookups:

from wiki_dump_extractor import WikiAvroDumpExtractor

extractor = WikiAvroDumpExtractor(file_path="wiki_dump.avro")
extractor.index_pages(page_index_db="page_index.lmdb")

Later on, read the Avro file and use redirects and index as follows (reads the 12 million pages in ~3-4 minutes depending on your machine):

from wiki_dump_extractor import WikiAvroDumpExtractor

# Create extractor
extractor = WikiAvroDumpExtractor(
    file_path="wiki_dump.avro",
    index_dir="page_index.lmdb"  # Use the index for faster lookups
)

# Get pages with automatic redirect resolution
pages = extractor.get_page_batch_by_title(
    ["Page Title 1", "Page Title 2"]
)

Installation

pip install wiki-dump-extractor

Or from the source in development mode:

pip install -e .

To use the LLM-specific module (that would be mostly if you are on a project like Landnotes), use

pip install wiki-dump-extractor[llm]

Or locally:

pip install -e ".[llm]"

To install with tests, use pip install -e ".[dev]" then run the tests with pytest in the root directory.

Requirements for running the LLM utils

# Add the Cloud SDK distribution URI as a package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Import the Google Cloud public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_dump_extractor-0.1.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiki_dump_extractor-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file wiki_dump_extractor-0.1.0.tar.gz.

File metadata

  • Download URL: wiki_dump_extractor-0.1.0.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wiki_dump_extractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0f4e565af7d816f6cdb1054c5bb2c20761c50b283db65c0be0b3b4bf0bf71d28
MD5 070587feb2e5a96962a33170fc504a6c
BLAKE2b-256 2d51227b2a5bad89ef12d81087f21ea52bd8f2cf5ec6fc16ea1845d382dc8676

See more details on using hashes here.

Provenance

The following attestation bundles were made for wiki_dump_extractor-0.1.0.tar.gz:

Publisher: publish.yml on Zulko/wiki_dump_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wiki_dump_extractor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for wiki_dump_extractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df95e87188011c60c99e9553d575620e313cf4fd0e0a9036e42df1b9331e7478
MD5 b25592231101f35fafd32cc6e4aa6f4e
BLAKE2b-256 a7eaefd144687b2db6780307c68ba4d6185382fbd7edbbf377e7b0621baf6fce

See more details on using hashes here.

Provenance

The following attestation bundles were made for wiki_dump_extractor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Zulko/wiki_dump_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page