Skip to main content

A Python Library to process wiki dumps xml.

Project description

wikixml

A Python Library to process wiki dumps xml.

Install

pip install wikixml --upgrade

Download Wiki Dumps

Visit: https://dumps.wikimedia.org/zhwiki/latest/

Download the latest wiki dump file with proxy:

curl -L --proxy http://127.0.0.1:11111 -o ~/repos/wikixml/data/zhwiki-latest-pages-meta-current.xml.bz2 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-meta-current.xml.bz2

WikiXmlParser

Run example:

python example.py

See: example.py

from wikixml import WikiXmlParser

if __name__ == "__main__":
    wiki_xml_bz2 = "zhwiki-20241101-pages-meta-current.xml.bz2"
    file_path = Path(__file__).parent / "data" / wiki_xml_bz2
    parser = WikiXmlParser(file_path)
    # parser.preview_lines(5000)
    parser.preview_pages(max_pages=100)

WikiPagesMongoWriter

Extract wiki pages from XML and write to MongoDB

python -m wikixml.mongo -d zhwiki -f "../data/zhwiki-latest-pages-meta-current.xml.bz2"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikixml-0.3.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikixml-0.3-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file wikixml-0.3.tar.gz.

File metadata

  • Download URL: wikixml-0.3.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for wikixml-0.3.tar.gz
Algorithm Hash digest
SHA256 1d4dd520f182f8c0aa5899e9ba36542a0d5295d67e1ce6e4558c5982b8bac5da
MD5 0a7ad9a505c93c3f50254e3e78435434
BLAKE2b-256 7c21e6086df001c2cd163b110df503314d63e79d5d9fae25f76397d974de89e8

See more details on using hashes here.

File details

Details for the file wikixml-0.3-py3-none-any.whl.

File metadata

  • Download URL: wikixml-0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for wikixml-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a44ecb6fe85a948a9af79caafb823f8940c6af56121d1e310558d42091b300a4
MD5 ea96a648cac55dc6d15004189c62c4bc
BLAKE2b-256 83a006d281ef7d4b2e9d2320ca93ebf9bd981fd5579955265c1202d9c9102af8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page