Mediawiki history dumps scraper, a module that scrapes the site of "Mediawiki history dumps" and returns to you the available content.
Project description
mediawiki-history-dumps-scraper
This is the pip module of "Mediawiki history dumps scraper", refer to the main branch to see in general the projects' purpose.
What does the module do?
This pip module allows you to get (also selectively), through a scraper, the available content in Mediawiki history dumps. You can check wich versions are available, which language, which datasets, the download links, the size...
How was it made?
This module was written in Python 3.9 and uses requests
and regexps to scrape the content from the Download site. The package manager is poetry, because it is far better than just pip and because it has a command to directly publish it to pypi. It is also linted with pylint.
How to use it?
Installation
With pip:
pip install mhdscraper
With pipenv:
pipenv install mhdscraper
With poetry:
poetry add mhdscraper
Examples
An example (you can add print of a variable to see the response).
import mhdscraper
from datetime import date
# Returns the root url of the datasets site
wiki_url = print(mhdscraper.WIKI_URL)
# Returns a list of versions, returning the version name and its url
versions = mhdscraper.fetch_versions()
# Returns a list of datasets, returning the dataset name, its url and
# including all the available wikies (name and url)
versions_with_langs = mhdscraper.fetch_versions(wikies=True)
# Returns a list containing all the wikies of the latest version,
# returning name and url
wikies = mhdscraper.fetch_wikies('latest')
# Returns a list containing the wikies ending with 'wiki' of the
# latest version, returning name and url
wikies_ending_with_wiki = mhdscraper.fetch_wikies('latest', wikitype='wiki')
# Returns a list containing the wikies starting with 'it' of the latest version,
# returning name, url and the list of available dumps
wikies_with_dumps = mhdscraper.fetch_wikies('latest', lang='it', dumps=True)
# Returns a list containing all the dumps of 'itwiki' of the latest version,
# reurning many pieces of information such as filename, start and end date
# of the content, size in bytes, url to download it...
dumps = mhdscraper.fetch_dumps('latest', 'itwiki')
# Returna a listo containing all the dumps of 'itwiki' of the latest version,
# whose content is between 2004-01-01 and 2005-02-01
dumps_selected = mhdscraper.fetch_dumps('latest', 'itwiki', start=date(2004, 1, 1), end=date(2005, 2, 1))
The result of:
import mhdscraper
from datetime import date
result = mhdscraper.fetch_wikies('latest', lang='it', wikitype='wiki', dumps=True, start=date(2010, 1, 1), end=date(2012, 12, 31))
Would be (as of July 2021):
[
{
"dumps": [
{
"bytes": "691419132",
"filename": "2021-06.itwiki.2010.tsv.bz2",
"from": "2010-01-01",
"lastUpdate": "2021-07-03T10:38:00",
"time": "2010",
"to": "2010-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2010.tsv.bz2"
},
{
"bytes": "706208269",
"filename": "2021-06.itwiki.2011.tsv.bz2",
"from": "2011-01-01",
"lastUpdate": "2021-07-03T10:57:00",
"time": "2011",
"to": "2011-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2011.tsv.bz2"
},
{
"bytes": "747376403",
"filename": "2021-06.itwiki.2012.tsv.bz2",
"from": "2012-01-01",
"lastUpdate": "2021-07-03T10:11:00",
"time": "2012",
"to": "2012-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2012.tsv.bz2"
}
],
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki",
"wiki": "itwiki"
}
]
API
WIKI_URL
It is a constant containing the url of the root of the datasets site
fetch_latest_version(*, wikies, lang, wikitype, dumps, start, end)
Fetches the last version of the mediawiki history dumps.
The version is the year-month of the release of the dumps
Keyword parameters:
- wikies (bool, default=False): If for each returned version the wikies will be fetched
- lang (str, default=None): If the wikies argument is True, the language of the wikies to return (a wiki name starts with the language).
- wikitype (str, default=None): If the wikies argument is True, the wiki type of the wikies to return (a wiki name ends with the wiki type).
- dumps (bool, default=false): If for each returned wiki the wikies will be fetched
- start (date, default=None): If the wikies and dumps arguments are True, retrieve only the dumps newer than this date
- end (date, default=None): If the wikies and dumps arguments are True, retrieve only the dumps older than this date
Returns a dict with:
version
(str) for the version year-monthurl
(str) for the url of that version.wikies
will contain the fetched wikies if the argument was set to True.
If no version is found,None
is returned.
fetch_versions(*, wikies, lang, wikitype, dumps, start, end)
Fetch the versions of the mediawiki history dumps
The versions are the year-month of the release of the dumps
Keyword parameters:
- wikies (bool, default=False): If for each returned version the wikies will be fetched
- lang (str, default=None): If the wikies argument is True, the language of the wikies to return (a wiki name starts with the language).
- wikitype (str, default=None): If the wikies argument is True, the wiki type of the wikies to return (a wiki name ends with the wiki type).
- dumps (bool, default=false): If for each returned wiki the wikies will be fetched
- start (date, default=None): If the wikies and dumps arguments are True, retrieve only the dumps newer than this date
- end (date, default=None): If the wikies and dumps arguments are True, retrieve only the dumps older than this date
Returns a list of dicts with:
version
(str) for the version year-monthurl
(str) for the url of that version.wikies
will contain the fetched wikies if the argument was set to True (see fetch_wikies to see the result).
fetch_wikies(version, /, *, lang, wikitype, dumps, start, end)
Fetch the wikies of a version of the mediawiki history dumps
Parameters:
- version (str): The version whose wikies will be returned. If "latest" is passed, the latest version is retrieved.
Keyword parameters:
- lang (str, default=None): The language of the wikies to return (a wiki name starts with the language).
- wikitype (str, default=None): The wiki type of the wikies to return (a wiki name ends with the wiki type).
- dumps (bool, default=false): If for each returned wiki the dumps will be fetched
- start (date, default=None): If the dumps argument is True, retrieve only the dumps newer than this date
- end (date, default=None): If the dumps argument is True, retrieve only the dumps older than this date
Returns a list of dicts with:
wiki
(str) for the wiki nameurl
(str) for the url of that wiki. In addition, if thedumps
argument is True, adumps
(list) field contain the fetched dumps (see fetch_dumps to see the reuslt).
fetch_dumps(version, wiki, /, *, start, end)
Fetch the dumps of a wiki of the mediawiki history dumps
Parameters:
- version (str): The version of the wiki
- wiki (str): The wiki whose dumps will be returned
Keyword parameters:
- start (date, default=None): Retrieve only the dumps newer than this date
- end (date, default=None): Retrieve only the dumps older than this date
Returns a list of dicts with:
filename
(str) for dump file nametime
(str) for the time of the data ('all-time'
, year or year-monthlastUpdate
(datetime) for the last update datebytes
(int) for the size in bytes of the filefrom
(date) for the start date of the datato
(date) for the end date of the dataurl
(str) the url of the file
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mhdscraper-1.0.3.tar.gz
.
File metadata
- Download URL: mhdscraper-1.0.3.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.7 CPython/3.9.5 Linux/5.8.0-63-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb0bb53460147f498464f2147a6ed261cb4c81d74ce9aa699c0a800926694f34 |
|
MD5 | ccf032eb817ef912710c7343aa3156a4 |
|
BLAKE2b-256 | 545e4bd9bff8cba9082055c6eea63d218a4352a305d0ffb9a08dc5011ee35706 |
File details
Details for the file mhdscraper-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: mhdscraper-1.0.3-py3-none-any.whl
- Upload date:
- Size: 30.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.7 CPython/3.9.5 Linux/5.8.0-63-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f221470fe9cd5005074f11adc28a41252e2683d4b5dd249079e6d2c36f52c468 |
|
MD5 | 9eaae183d2359622a4e21ed22929044e |
|
BLAKE2b-256 | 1a3315cb9f0f74a458061fac5848a79aefb716f437089bb13b9ba3082395ecb3 |