Skip to main content

A package for traversing and downloading files from Wiki Data Dump mirrors.

Project description

wiki_data_dump

A library that assists in traversing and downloading from Wikimedia Data Dumps and its mirrors.

Purpose

To make the maintenance of large wiki datasets easier and more stable.

In addition, the purpose is to lighten the load on Wikimedia and its mirrors by accessing only the index of the site, and doing the inevitable searching and navigation of its contents offline.

A web crawler might make multiple requests to find its file (in addition to navigating with the notorious fragility of a web crawler), while wiki_data_dump caches the site's contents - which not only provides a speed boost for multiple uses of the library but protects against accidentally flooding Wikimedia with requests by not relying on requests for site navigation.

Installation

pip install wiki_data_dump

Usage

One could easily get all available job names for any given wiki with this short script:

from wiki_data_dump import WikiDump, Wiki

wiki = WikiDump()
en_wiki: Wiki = wiki.get_wiki('enwiki')

print(en_wiki.jobs.keys())

Or, you could see the available files from the categorytables sql job.

from wiki_data_dump import WikiDump, Job

wiki = WikiDump()
categories: Job = wiki.get_job("enwiki", "categorytables")

print(categories.files.keys())

A slightly more nontrivial example - querying for specific file types when a job may contain more files than we need.

For example, it's not uncommon to find a job that has partial data dumps - making it necessary to know the file paths of all parts. If you're hard-coding all the file names, it becomes increasingly difficult to find the relevant files.

This is a solution that wiki_data_dump provides:

from wiki_data_dump import WikiDump, File
import re
from typing import List

wiki = WikiDump()

xml_stubs_dump_job = wiki["enwiki", "xmlstubsdump"]

stub_history_files: List[File] = xml_stubs_dump_job.get_files(
    re.compile(r"stub-meta-history[0-9]+\.xml\.gz$")
)

for file in stub_history_files:
    wiki.download(file).join()

Download processes are threaded by default, and the call to WikiDump.download returns a reference to the thread it's running in.

The process is simple and readable:

  1. Get the job that contains the files desired.
  2. Filter the files to only contain those that you need.
  3. Download the files concurrently (or in parallel).

For more direction on how to use this library, see tests.py or examples in examples.

Next steps

  • Automatic detection of which mirror has the fastest download speed at any given time.
  • Caching that updates only when a resource is out of date, instead of just when the current date has passed the cache's creation date.
  • The ability to access Wikimedia downloads available in /other/.

Disclaimer

The author of this software is not affiliated, associated, authorized, endorsed by, or in any way officially connected with Wikimedia or any of its affiliates and is independently owned and created.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_data_dump-0.1.1.tar.gz (11.5 kB view details)

Uploaded Source

File details

Details for the file wiki_data_dump-0.1.1.tar.gz.

File metadata

  • Download URL: wiki_data_dump-0.1.1.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for wiki_data_dump-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c6dc02b6d33cff1b453accdd870d28aae705465e859648ab8ed5243f8fb6f01e
MD5 1ced5f3b079f4ec82845bc6d7f629d5a
BLAKE2b-256 32d0681f8927b26fba439db4e0dc3f9cd28628ee37be13b53f2a5ddd4784c702

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page