Utility for downloading and checking the status of Wikimedia dumps.
Project description
Kensho Wikimedia for Natural Language Processing - Dump Downloader
kwnlp_dump_downloader is a Python package to help you download raw Wikimedia dumps.
Quick Install (Requires Python >= 3.6)
pip install kwnlp-dump-downloader
Python Package
This python package provides two main pieces of functionality. The first allows you to check the status of a Wikipedia dump and the second allows you to download specific parts of a Wikipedia dump (called jobs).
Checking Status
from kwnlp_dump_downloader import get_dump_status
wp_yyyymmdd = "20200920"
wds = get_dump_status(wp_yyyymmdd)
print(wds.report())
abstractsdump: done ✅
abstractsdumprecombine: done ✅
allpagetitlesdump: done ✅
articlesdump: done ✅
articlesdumprecombine: done ✅
articlesmultistreamdump: done ✅
articlesmultistreamdumprecombine: done ✅
babeltable: done ✅
categorylinkstable: done ✅
categorytable: done ✅
changetagdeftable: done ✅
changetagstable: done ✅
externallinkstable: done ✅
flaggedimagestable: done ✅
flaggedpageconfigtable: done ✅
flaggedpagependingtable: done ✅
flaggedpagestable: done ✅
flaggedrevspromotetable: done ✅
flaggedrevsstatisticstable: done ✅
flaggedrevstable: done ✅
flaggedrevstrackingtable: done ✅
flaggedtemplatestable: done ✅
geotagstable: done ✅
imagelinkstable: done ✅
imagetable: done ✅
iwlinkstable: done ✅
langlinkstable: done ✅
metacurrentdump: done ✅
metacurrentdumprecombine: done ✅
metahistory7zdump: done ✅
metahistorybz2dump: done ✅
namespaces: done ✅
pagelinkstable: done ✅
pagepropstable: done ✅
pagerestrictionstable: done ✅
pagetable: done ✅
pagetitlesdump: done ✅
protectedtitlestable: done ✅
redirecttable: done ✅
sitestable: done ✅
sitestatstable: done ✅
templatelinkstable: done ✅
userformergroupstable: done ✅
usergroupstable: done ✅
wbcentityusagetable: done ✅
xmlpagelogsdump: done ✅
xmlpagelogsdumprecombine: done ✅
xmlstubsdump: done ✅
xmlstubsdumprecombine: done ✅
Downloading Jobs
from kwnlp_dump_downloader import download_jobs
wp_yyyymmdd = "20200920"
wd_yyyymmdd = "20200921"
data_path = "/path/where/data/should/live"
jobs_to_download = ["pagetable", "articlesdump"]
download_jobs(wp_yyyymmdd, wd_yyyymmdd, data_path=data_path, jobs_to_download=jobs_to_download)
Any of the jobs listed in the status report above can be specified in the jobs_to_download
kwarg. In addition, there are two special job strings,
pageviewcomplete
: used to download monthly pageviews (e.g.pageviews-20200901-user.bz2
)wikidata
: used to download Wikidata json dumps (e.g.wikidata-20200921-all.json.bz2
)
Command Line Interface
If you prefer to use the command line to check status and download dumps, you can do that too. After pip installing this package, you should have two new commands available,
Checking Status
usage: kwnlp-get-dump-status [-h] [--mirror_url MIRROR_URL] [--wiki WIKI] [--loglevel LOGLEVEL] wp_yyyymmdd
get Wikipedia dump status
positional arguments:
wp_yyyymmdd date string for Wikipedia dump (e.g. 20200920)
optional arguments:
-h, --help show this help message and exit
--mirror_url MIRROR_URL
base URL for Wikimedia dumps (e.g. https://dumps.wikimedia.org)
--wiki WIKI selects which language wikipedia to use (e.g. enwiki)
--loglevel LOGLEVEL python logging level integer (e.g. 20)
Downloading Jobs
usage: kwnlp-download-jobs [-h] [--data_path DATA_PATH] [--mirror_url MIRROR_URL] [--wiki WIKI] [--jobs JOBS]
[--loglevel LOGLEVEL]
wp_yyyymmdd wd_yyyymmdd
download Wikimedia data
positional arguments:
wp_yyyymmdd date string for Wikipedia dump (e.g. 20200920)
wd_yyyymmdd date string for Wikidata dump (e.g. 20200921)
optional arguments:
-h, --help show this help message and exit
--data_path DATA_PATH
path to top level data directory (e.g. /data/wikimedia-ingestion)
--mirror_url MIRROR_URL
base URL for Wikimedia dumps (e.g. https://dumps.wikimedia.org)
--wiki WIKI selects which language wikipedia to use (e.g. enwiki)
--jobs JOBS comma separated list of job names to download (e.g. pagecounts,pagetable)
--loglevel LOGLEVEL python logging level integer (e.g. 20)
License
Licensed under the Apache 2.0 License. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2020-present Kensho Technologies, LLC. The present date is determined by the timestamp of the most recent commit in the repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kwnlp_dump_downloader-0.1.0.tar.gz
.
File metadata
- Download URL: kwnlp_dump_downloader-0.1.0.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cc70b798dda88594ccf0a26855739bcb405e97ecd69c0c7972bae637e334c6a |
|
MD5 | fbc71d33695f8acefaf9dd9b063d2355 |
|
BLAKE2b-256 | a1df6a4f708399ebb16c8e5ad65c60c4a5813359057cc9a6258ba2925eaac488 |
File details
Details for the file kwnlp_dump_downloader-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: kwnlp_dump_downloader-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 477f9258fa97529873e78cde7d1c4c39bfbde31802dd31a2f7888d0e9e350b2c |
|
MD5 | 61c55b74b836267bb19bb18abf83caed |
|
BLAKE2b-256 | bb47fadd593e1b1cb10b7ba12282ece6ef128ac713a4798f92a9da1ccd28d991 |