Work with Wikipedia dumps
WPyDumps is a Python module to work with dumps of Wikipedia.
It allows one to parse and extract relevant information from dump files without un-compressing them on-disk.
It works with (at least) these dumps:
pages-meta-history….xml-….7z(“All pages with complete edit history”)
This is quite experimental for now.
pip install wpydumps
The parser uses SAX to read the files as a stream. It takes a reader or a filename and a page callback function. It parses the file and call that function with each page.
Pages are represented as
wpydumps.model.Page objects. They include the pages’
details as well as their revisions (
wpydumps.model.Revision). Each revision
holds a reference to its contributor (
import wpydumps as p def simple_page_callback(page): print(page.title) # parse from a local archive p.parse_pages_from_archive_filename("myfile.7z", simple_page_callback) # parse from an uncompressed file with open("myfile") as f: p.parse_pages_from_reader(f, simple_page_callback)
The text of each revision is dropped by default. You can disable this behavior
keep_revisions_text=True to the parser function. Revisions always
from wpydumps import parse_pages_from_archive_filename def page_callback(page): pass # do something with the page # use the appropriate filename parse_pages_from_archive_filename( "frwiki-20190901-pages-meta-history1.xml-p3p1630.7z", page_callback)
Print all pages and their number of revisions
def page_callback(page): print(page.title, len(page.revisions))
Print all pages and their number of contributors
def page_callback(page): contributors = set() for rev in page.revisions: contributors.add(rev.contributor.username or rev.contributor.ip) print("%s: %d contributors" % (page.title, len(contributors)))
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size wpydumps-0.0.3-py3-none-any.whl (6.3 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size wpydumps-0.0.3.tar.gz (5.5 kB)||File type Source||Python version None||Upload date||Hashes View|