A Python Library to process wiki dumps xml.
Project description
wikixml
A Python Library to process wiki dumps xml.
Install
pip install wikixml --upgrade
Download Wiki Dumps
Visit: https://dumps.wikimedia.org/zhwiki/latest/
Download the latest wiki dump file with proxy:
curl -L --proxy http://127.0.0.1:11111 -o ~/repos/wikixml/data/zhwiki-latest-pages-meta-current.xml.bz2 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-meta-current.xml.bz2
WikiXmlParser
Run example:
python example.py
See: example.py
from wikixml import WikiXmlParser
if __name__ == "__main__":
wiki_xml_bz2 = "zhwiki-20241101-pages-meta-current.xml.bz2"
file_path = Path(__file__).parent / "data" / wiki_xml_bz2
parser = WikiXmlParser(file_path)
# parser.preview_lines(5000)
parser.preview_pages(max_pages=100)
WikiPagesMongoWriter
Extract wiki pages from XML and write to MongoDB
python -m wikixml.mongo -d zhwiki -f "../data/zhwiki-latest-pages-meta-current.xml.bz2"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wikixml-0.3.tar.gz
(5.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
wikixml-0.3-py3-none-any.whl
(6.5 kB
view details)
File details
Details for the file wikixml-0.3.tar.gz.
File metadata
- Download URL: wikixml-0.3.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d4dd520f182f8c0aa5899e9ba36542a0d5295d67e1ce6e4558c5982b8bac5da
|
|
| MD5 |
0a7ad9a505c93c3f50254e3e78435434
|
|
| BLAKE2b-256 |
7c21e6086df001c2cd163b110df503314d63e79d5d9fae25f76397d974de89e8
|
File details
Details for the file wikixml-0.3-py3-none-any.whl.
File metadata
- Download URL: wikixml-0.3-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a44ecb6fe85a948a9af79caafb823f8940c6af56121d1e310558d42091b300a4
|
|
| MD5 |
ea96a648cac55dc6d15004189c62c4bc
|
|
| BLAKE2b-256 |
83a006d281ef7d4b2e9d2320ca93ebf9bd981fd5579955265c1202d9c9102af8
|