A Python package for extracting and processing Wikipedia data
Project description
Wiki dump extractor
A python library to extract and analyze pages from a wiki dump.
This library is used in particular in the Landnotes project to extract and analyze pages from the Wikipedia dump.
The project is hosted on GitHub an the HTML documentation is available here.
Scope
Make the wikipedia dumps easier to work with:
- Extract pages from a wiki dump
- Be easy to install and run
- Be fast (can iterate over 50,000 pages / secong using Avro)
- Be memory efficient
- Allow for batch processing and parallel processing
Provide utilities for page analysis:
- Date parsing
- Section extraction
- Text cleaning
- and more.
Usage
To simply iterate over the pages in the dump:
from wiki_dump_extractor import WikiDumpExtractor
dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
for page in extractor.iter_pages(limit=1000):
print(page.title)
To extract the pages by batches (here we save the pages separate CSV files):
from wiki_dump_extractor import WikiDumpExtractor
dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2"
extractor = WikiDumpExtractor(file_path=dump_file)
batches = extractor.iter_page_batches(batch_size=1000, limit=10)
for i, batch in enumerate(batches):
df = pandas.DataFrame([page.to_dict() for page in batch])
df.to_csv(f"batch_{i}.csv")
Converting the dump to Avro
There are many reasons why you might want to convert the dump to Avro. The original xml.bz2 dump is 22Gb but very slow to read from (250/s), the uncompressed dump is 107Gb, relatively fast to read (this library uses lxml which reads thousands of pages per second), however 50% of the pages in there are empty redirect pages.
The following code converts the batch to a 28G avro dump that only contains the 12 million real pages, stores redirects in a fast LMDB database, and creates an index for quick page lookups. The operation takes ~40 minutes depending on your machine.
from wiki_dump_extractor import WikiXmlDumpExtractor
file_path = "enwiki-20250201-pages-articles-multistream.xml"
extractor = WikiXmlDumpExtractor(file_path=file_path)
ignored_fields = ["timestamp", "page_id", "revision_id", "redirect_title"]
extractor.extract_pages_to_avro(
output_file="wiki_dump.avro",
redirects_db_path="redirects.lmdb", # LMDB database for fast redirect lookups
ignored_fields=ignored_fields,
)
Then index the pages for fast lookups:
from wiki_dump_extractor import WikiAvroDumpExtractor
extractor = WikiAvroDumpExtractor(file_path="wiki_dump.avro")
extractor.index_pages(page_index_db="page_index.lmdb")
Later on, read the Avro file and use redirects and index as follows (reads the 12 million pages in ~3-4 minutes depending on your machine):
from wiki_dump_extractor import WikiAvroDumpExtractor
# Create extractor
extractor = WikiAvroDumpExtractor(
file_path="wiki_dump.avro",
index_dir="page_index.lmdb" # Use the index for faster lookups
)
# Get pages with automatic redirect resolution
pages = extractor.get_page_batch_by_title(
["Page Title 1", "Page Title 2"]
)
Installation
pip install wiki-dump-extractor
Or from the source in development mode:
pip install -e .
To use the LLM-specific module (that would be mostly if you are on a project like Landnotes), use
pip install wiki-dump-extractor[llm]
Or locally:
pip install -e ".[llm]"
To install with tests, use pip install -e ".[dev]" then run the tests with pytest in the root directory.
Requirements for running the LLM utils
# Add the Cloud SDK distribution URI as a package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
# Import the Google Cloud public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiki_dump_extractor-0.1.0.tar.gz.
File metadata
- Download URL: wiki_dump_extractor-0.1.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f4e565af7d816f6cdb1054c5bb2c20761c50b283db65c0be0b3b4bf0bf71d28
|
|
| MD5 |
070587feb2e5a96962a33170fc504a6c
|
|
| BLAKE2b-256 |
2d51227b2a5bad89ef12d81087f21ea52bd8f2cf5ec6fc16ea1845d382dc8676
|
Provenance
The following attestation bundles were made for wiki_dump_extractor-0.1.0.tar.gz:
Publisher:
publish.yml on Zulko/wiki_dump_extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wiki_dump_extractor-0.1.0.tar.gz -
Subject digest:
0f4e565af7d816f6cdb1054c5bb2c20761c50b283db65c0be0b3b4bf0bf71d28 - Sigstore transparency entry: 227029359
- Sigstore integration time:
-
Permalink:
Zulko/wiki_dump_extractor@4009f00fbabd198e2e353f6ee3cc9e183476c4b5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Zulko
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4009f00fbabd198e2e353f6ee3cc9e183476c4b5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file wiki_dump_extractor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: wiki_dump_extractor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df95e87188011c60c99e9553d575620e313cf4fd0e0a9036e42df1b9331e7478
|
|
| MD5 |
b25592231101f35fafd32cc6e4aa6f4e
|
|
| BLAKE2b-256 |
a7eaefd144687b2db6780307c68ba4d6185382fbd7edbbf377e7b0621baf6fce
|
Provenance
The following attestation bundles were made for wiki_dump_extractor-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Zulko/wiki_dump_extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wiki_dump_extractor-0.1.0-py3-none-any.whl -
Subject digest:
df95e87188011c60c99e9553d575620e313cf4fd0e0a9036e42df1b9331e7478 - Sigstore transparency entry: 227029360
- Sigstore integration time:
-
Permalink:
Zulko/wiki_dump_extractor@4009f00fbabd198e2e353f6ee3cc9e183476c4b5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Zulko
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4009f00fbabd198e2e353f6ee3cc9e183476c4b5 -
Trigger Event:
release
-
Statement type: