Skip to main content

Utility for downloading and checking the status of Wikimedia dumps.

Project description

Kensho Wikimedia for Natural Language Processing - Dump Downloader

kwnlp_dump_downloader is a Python package to help you download raw Wikimedia dumps.

Quick Install (Requires Python >= 3.6)

pip install kwnlp-dump-downloader

Python Package

This python package provides two main pieces of functionality. The first allows you to check the status of a Wikipedia dump and the second allows you to download specific parts of a Wikipedia dump (called jobs).

Checking Status

from kwnlp_dump_downloader import get_dump_status
wp_yyyymmdd = "20200920"
wds = get_dump_status(wp_yyyymmdd)
print(wds.report())
abstractsdump: done ✅
abstractsdumprecombine: done ✅
allpagetitlesdump: done ✅
articlesdump: done ✅
articlesdumprecombine: done ✅
articlesmultistreamdump: done ✅
articlesmultistreamdumprecombine: done ✅
babeltable: done ✅
categorylinkstable: done ✅
categorytable: done ✅
changetagdeftable: done ✅
changetagstable: done ✅
externallinkstable: done ✅
flaggedimagestable: done ✅
flaggedpageconfigtable: done ✅
flaggedpagependingtable: done ✅
flaggedpagestable: done ✅
flaggedrevspromotetable: done ✅
flaggedrevsstatisticstable: done ✅
flaggedrevstable: done ✅
flaggedrevstrackingtable: done ✅
flaggedtemplatestable: done ✅
geotagstable: done ✅
imagelinkstable: done ✅
imagetable: done ✅
iwlinkstable: done ✅
langlinkstable: done ✅
metacurrentdump: done ✅
metacurrentdumprecombine: done ✅
metahistory7zdump: done ✅
metahistorybz2dump: done ✅
namespaces: done ✅
pagelinkstable: done ✅
pagepropstable: done ✅
pagerestrictionstable: done ✅
pagetable: done ✅
pagetitlesdump: done ✅
protectedtitlestable: done ✅
redirecttable: done ✅
sitestable: done ✅
sitestatstable: done ✅
templatelinkstable: done ✅
userformergroupstable: done ✅
usergroupstable: done ✅
wbcentityusagetable: done ✅
xmlpagelogsdump: done ✅
xmlpagelogsdumprecombine: done ✅
xmlstubsdump: done ✅
xmlstubsdumprecombine: done 

Downloading Jobs

from kwnlp_dump_downloader import download_jobs
wp_yyyymmdd = "20200920"
wd_yyyymmdd = "20200921"
data_path = "/path/where/data/should/live"
jobs_to_download = ["pagetable", "articlesdump"]
download_jobs(wp_yyyymmdd, wd_yyyymmdd, data_path=data_path, jobs_to_download=jobs_to_download)

Any of the jobs listed in the status report above can be specified in the jobs_to_download kwarg. In addition, there are two special job strings,

  • pageviewcomplete: used to download monthly pageviews (e.g. pageviews-20200901-user.bz2)
  • wikidata: used to download Wikidata json dumps (e.g. wikidata-20200921-all.json.bz2)

Command Line Interface

If you prefer to use the command line to check status and download dumps, you can do that too. After pip installing this package, you should have two new commands available,

Checking Status

usage: kwnlp-get-dump-status [-h] [--mirror_url MIRROR_URL] [--wiki WIKI] [--loglevel LOGLEVEL] wp_yyyymmdd

get Wikipedia dump status

positional arguments:
  wp_yyyymmdd           date string for Wikipedia dump (e.g. 20200920)

optional arguments:
  -h, --help            show this help message and exit
  --mirror_url MIRROR_URL
                        base URL for Wikimedia dumps (e.g. https://dumps.wikimedia.org)
  --wiki WIKI           selects which language wikipedia to use (e.g. enwiki)
  --loglevel LOGLEVEL   python logging level integer (e.g. 20)

Downloading Jobs

usage: kwnlp-download-jobs [-h] [--data_path DATA_PATH] [--mirror_url MIRROR_URL] [--wiki WIKI] [--jobs JOBS]
                           [--loglevel LOGLEVEL]
                           wp_yyyymmdd wd_yyyymmdd

download Wikimedia data

positional arguments:
  wp_yyyymmdd           date string for Wikipedia dump (e.g. 20200920)
  wd_yyyymmdd           date string for Wikidata dump (e.g. 20200921)

optional arguments:
  -h, --help            show this help message and exit
  --data_path DATA_PATH
                        path to top level data directory (e.g. /data/wikimedia-ingestion)
  --mirror_url MIRROR_URL
                        base URL for Wikimedia dumps (e.g. https://dumps.wikimedia.org)
  --wiki WIKI           selects which language wikipedia to use (e.g. enwiki)
  --jobs JOBS           comma separated list of job names to download (e.g. pagecounts,pagetable)
  --loglevel LOGLEVEL   python logging level integer (e.g. 20)

License

Licensed under the Apache 2.0 License. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020-present Kensho Technologies, LLC. The present date is determined by the timestamp of the most recent commit in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwnlp_dump_downloader-0.1.0.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

kwnlp_dump_downloader-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file kwnlp_dump_downloader-0.1.0.tar.gz.

File metadata

  • Download URL: kwnlp_dump_downloader-0.1.0.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.3

File hashes

Hashes for kwnlp_dump_downloader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2cc70b798dda88594ccf0a26855739bcb405e97ecd69c0c7972bae637e334c6a
MD5 fbc71d33695f8acefaf9dd9b063d2355
BLAKE2b-256 a1df6a4f708399ebb16c8e5ad65c60c4a5813359057cc9a6258ba2925eaac488

See more details on using hashes here.

File details

Details for the file kwnlp_dump_downloader-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kwnlp_dump_downloader-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.3

File hashes

Hashes for kwnlp_dump_downloader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 477f9258fa97529873e78cde7d1c4c39bfbde31802dd31a2f7888d0e9e350b2c
MD5 61c55b74b836267bb19bb18abf83caed
BLAKE2b-256 bb47fadd593e1b1cb10b7ba12282ece6ef128ac713a4798f92a9da1ccd28d991

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page