Skip to main content

Read Wikipedia dumps

Project description

WPyDumps

WPyDumps is a Python module to work with dumps of Wikipedia.

It allows one to parse and extract relevant information from dump files without un-compressing them on-disk.

It works with (at least) these dumps:

  • pages-meta-history….xml-….7z (“All pages with complete edit history”)
  • pages-meta-current.xml.bz2

Install

pip install wpydumps

Usage

The parser uses SAX to read the files as a stream. It takes a reader or a filename and a page callback function. It parses the file and call that function with each page.

Pages are represented as wpydumps.model.Page objects. They include the pages’ details as well as their revisions (wpydumps.model.Revision). Each revision holds a reference to its contributor (wpydumps.model.Contributor) .

import wpydumps as p


def simple_page_callback(page):
    print(page.title)


# parse from a local archive
p.parse_pages_from_archive_filename("myfile.7z", simple_page_callback)

# parse from an uncompressed file
with open("myfile") as f:
    p.parse_pages_from_reader(f, simple_page_callback)

Revisions always have a text_length and diff_length int attributes. You may drop the text content by passing keep_revisions_text=False to the parser.

Examples

from wpydumps import parse_pages_from_archive_filename


def page_callback(page):
    pass  # do something with the page


# use the appropriate filename
parse_pages_from_archive_filename("frwiki-20190901-pages-meta-history1.xml-p3p1630.7z", page_callback)

Print all pages and their number of revisions

def page_callback(page):
    print(page.title, len(page.revisions))

Print all pages and their number of contributors

def page_callback(page):
    contributors = set()
    for rev in page.revisions:
        contributors.add(rev.contributor.username or rev.contributor.ip)

    print("%s: %d contributors" % (page.title, len(contributors)))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wpydumps-0.2.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

wpydumps-0.2.0-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file wpydumps-0.2.0.tar.gz.

File metadata

  • Download URL: wpydumps-0.2.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.9.4

File hashes

Hashes for wpydumps-0.2.0.tar.gz
Algorithm Hash digest
SHA256 47d6828b0b12c7bb65d4e61f8890e1971d32f19d079c4e625deee4fdbdd3dae2
MD5 2689a68f7455d8605521ef2904c45177
BLAKE2b-256 76cebc615c9581d970a27b7950d123b7babb43d77b817a6b0268607b0264b05b

See more details on using hashes here.

File details

Details for the file wpydumps-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: wpydumps-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.9.4

File hashes

Hashes for wpydumps-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 694774fc113a7c0f5de4d1ca23fd03141ff2fbe9e00f9c30f4f2c21dcc837fc4
MD5 1d7d93cab68fae4e9b9a42c60d0d6cef
BLAKE2b-256 6fffcfdaeb26cb04bd851c4cd8f85a90c5e6b6d7c425cb82dee5bb3c0fc019f9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page