A package for requesting data from Wikipedia using the REST API.

These details have not been verified by PyPI

Project links

Reason this release was yanked:

This version is missing requirements.txt file and thus crashes on import

Project description

What is this?

Wikipls is a Python package meant to easily scrape data out of Wikipedia, using its REST API. This package is still in early development, but it has basic functionality all set.

Why does it exist?

The REST API for wikimedia, isn't the most intuitive and requires some learning. When writing code, it also requires setting up a few functions to make it more manageable and readable. So essentially I made these functions and packaged them so that you (and I) won't have to rewrite them every time. While I'm at it I made them more intuitive and easy to use without needing to figure out how this API even works.

Installation

To install use:
pip install py-wikipls

Then in your code add:
import wikipls

How to use

I haven't made any documentation page yet, so for now the below will have to do.
If anything is unclear don't hesitate to open an issue in Issues or bring it up in Discussions.
Updated for version: 0.0.1a7

Key

Many functions in this package require the name of the Wiki page you want to check in a URL-friendly format. The REST documentation refers to that as a the "key" of an article. For example:

The key of the article titled "Water" is: "Water"
The key of the article titled "Faded (Alan Walker song)" is: "Faded_(Alan_Walker_song)"
The key of the Article titled "Georgia (U.S. state)" is: "Georgia_(U.S._state)"

That key is what you enter in the name parameter of functions. The key is case-sensitive.

To get the key of an article you can:

Take a look at the url of the article.
The URL for "Faded" for example is "https://en.wikipedia.org/wiki/Faded_(Alan_Walker_song)". Notice it ends with "wiki/" followed by the key of the article.
Take the title of the article and replace all spaces with "_", it'll probably work just fine.
In the future there will be a function to get the key of a title.

GET functions

These functions can be used without needing to create an object.

Many of them require the key of an article as a string, and an optional a date (datetime.date) object to get results for a specific date.
Otherwise they require a revision ID (RevisionId) number.
Many functions are compatible with both options.

`get_summary(key: str, fmt: str = "text") -> str`

Returns a summary of the page.
fmt can be either "text" or "html"

>>> get_summary("Faded_(Alan_Walker_song)")[:120]
'"Faded" is a song by Norwegian record producer and DJ Alan Walker with vocals provided by Norwegian singer Iselin Solhei'

This examples returns the first 120 letters of the summary of the Faded page

`get_raw_text(key: str, id: RevisionId) -> str`

Returns the raw text of a page.
This function temporarily doesn't support date arguments. If you want the raw text of an old page, please use a revision ID

`get_html(key: str, old_id: RevisionId = None) -> str`

Returns the HTML of the page as a string. This can be later parsed using tools like BeautifulSoup.

If an old_id value is given then the HTML will be of an older version of the page, taken from the revision that has this ID.

>>> get_html("Faded_(Alan_Walker_song)")[:40]
'<!DOCTYPE html>\n<html prefix="dc: http:/'

This example returns the beginning of the html of the "Faded" page.

`get_key(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

Returns the key (URL-friendly name) of an old page.

`get_date(id: RevisionId, lang: str = consts.LANG) -> datetime.date`

Returns the date of a revision, given its ID.

`get_lint(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict`

See https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_lint__title___revision_.

`get_mobile_html(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

See https://en.wikipedia.org/api/rest_v1/#/Page%20content/getContentWithRevision-mobile-html.

`get_mobile_html_offline_resources(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

See https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_mobile_html_offline_resources__title___revision_.

`get_mobile_sections(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

See https://en.wikipedia.org/api/rest_v1/#/Mobile/getSectionsWithRevision.

`get_mobile_sections_lead(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

See https://en.wikipedia.org/api/rest_v1/#/Mobile/getSectionsLeadWithRevision.

`get_mobile_sections_remaining(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

See https://en.wikipedia.org/api/rest_v1/#/Mobile/getSectionsRemainingWithRevision.

`get_discussion(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict`

Returns info about the discussion tab of a page.

`get_related(key: str, lang: str = consts.LANG) -> tuple[dict]`

Returns articles that are related to the one given.

`get_views(name: str, date: str | datetime.date, lang: str = LANG) -> int`

Returns the number of times people visited an article on a given date.

The date given can be either a datetime.date object or a string formatted yyyymmdd (So March 31th 2024 will be "20240331").

>>> get_views("Faded_(Alan_Walker_song)", "20240331")
1144

The Faded page on Wikipedia was visited 1,144 on March 31st 2024.

`get_pdf(key: str) -> bytes`

Returns the PDF version of the page in byte-form.

>>> with open("faded_wiki.pdf", 'wb') as f:
f.write(get_pdf("Faded_(Alan_Walker_song)"))

This example imports the Faded page in PDF form as a new file named "faded_wiki.pdf".

`get_media_details(key: str, old_id: RevisionId = None) -> tuple[dict, ...]`

Returns all media present in the article, each media file represented as a JSON.

>>> get_media_details("Faded_(Alan_Walker_song)")[0]
{'title': 'File:Alan_Walker_-_Faded.png', 'leadImage': False, 'section_id': 0, 'type': 'image', 'showInGallery': True, 'srcset': [{'src': '//upload.wikimedia.org/wikipedia/en/thumb/d/da/Alan_Walker_-_Faded.png/220px-Alan_Walker_-_Faded.png', 'scale': '1x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '1.5x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '2x'}]}

This example returns the first media file in the Faded article, which is a PNG image.

`get_image(details: dict[str, ...]) -> bytes`

Retrives the actual byte-code of an image on a an article, using a JSON representing the image. You can get that JSON using get_media_details().

>>> get_image({'title': 'File:Alan_Walker_-_Faded.png', 'leadImage': False, 'section_id': 0, 'type': 'image', 'showInGallery': True, 'srcset': [{'src': '//upload.wikimedia.org/wikipedia/en/thumb/d/da/Alan_Walker_-_Faded.png/220px-Alan_Walker_-_Faded.png', 'scale': '1x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '1.5x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '2x'}]})
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01,\x00\x00\x01,\x08\x03\x00\x00\x00N\xa3~G\x00\x00\x03\x00PLTE\xff\xff\xff\x01\x01\x01\xfe\xfd\xfe'

This examples returns the first bytes of the image we got in the get_media_details() example.

`get_all_images(key: str, details: Iterable[dict[str, ...]], strict: bool = False) -> tuple[bytes]`

Returns all images of an article or a provided list of image-JSONs, in bytes form. You can only enter either a key or details value, no need for both.

Data functions

Data functions return a dictionary containing many details about the object. they can be used without needing to create an object.

These functions require either a key with an optional date or an ID.

`get_article_data(identifier: str | ArticleId, lang: str = consts.LANG) -> dict[str, ...]`

Returns details about an article in JSON form.
Identifier can be either the article's name or its ID.

`get_page_data(*args, lang: str = consts.LANG) -> dict[str, ...]`

Returns details and the content of a page (i.e specific version of an Article).

Arguments:

get_page_data(id: RevisionId): Get page by using ID of revision.
get_page_data(key: str): Get page by using its name. Since no date argument is given wikipls will return the most up-to-date page.
get_page_data(key: str, date: str | datetime.date): Get page by using its name and a date.

`get_revision_data(*args, lang: str = consts.LANG) -> dict[str, ...]`

Returns details about the latest revision to the page in JSON form.
If date is provided, returns the latest revision details as of that date.

Arguments:

get_revision_data(id: RevisionId): Get revision using its ID.
get_revision_data(key: str): Get revision by using its name. Since no date argument is given wikipls will return the latest revision.
get_revision_data(key: str, date: str | datetime.date): Get revision by using its name and a date.

Utility functions

Functions that make life easier

`to_timestamp(date: datetime.date | str) -> str`

Converts a datetime.date object or a string in format yyyy-mm-ddThh:mm:ssZ to a URL-friendly string format (yyyymmdd)

>>> date = datetime.date(2024, 3, 31)
>>> to_timestamp(date)
20240331

This example converts the date of March 31th 2024 to URL-friendly string form.

`from_timestamp(timestamp: str, out_fmt: type[datetime.date] | type[datetime.datetime] = datetime.date) -> datetime.date | datetime.datetime`

Converts a timestamp to a datetime.date or datetime.datetime object.
The timestamp is a string which is written in one of the following formats:

yyyymmdd
yyyy-mm-ddThh:mm:ssZ

`id_of_page(*args, lang: str = consts.LANG) -> RevisionId`

Returns an id of a page, given a key.
Date argument is optional: If date is provided, returns the ID of latest revision as of that date.

Arguments:

id_of_page(key: str, lang: str = consts.LANG): Get id using the page key. Since no date argument is given wikipls will return the ID of latest page.
id_of_page(key: str, date: str | datetime.date, lang: str = consts.LANG): Get id using the page key.

`key_of_page(id: ArticleId | RevisionId, lang=consts.LANG) -> str`

Returns the key of an article given its (or one of its revisions's) ID.

`response_for(url: str, params: dict = None) -> str`

For internal use mainly.

`json_response(url: str, params: dict = None) -> dict`

For interal use mainly

ID classes

It's very easy to confuse revision IDs for article IDs and vice versa.
To combat that IDs are only accepted as either of these:

ArticleId(int)
RevisionId(int)

These function just like a regular int. For example: To make a revision ID 314141, we do RevisionId(314141).

Wiki objects

If you intend on repeatedly getting info about some page, it is preferred that you make an object for that page.
This is for reasons of performance as well as readability and organization.

`wikipls.Article(key: str)`

An "Article" is a wikipedia article in all of its versions, revisions and languages.

Properties

.key (str): Article key (URL-friendly name).
.id (ArticleId): Article ID. Doesn't change across revisions.
.content_model (str): Type of wiki project this article is a part of (e.g. "wikitext", "wikionary").
.license (dict): Details about the copyright license of the article.
.html_url (str): URL to an html version of the current revision of the article.
.latest (dict): Details about the latest revision done to the article.
.details (dict[str, Any]): All the above properties in JSON form.\

.get_page(date: datetime.date, lang: str = "en") (wikipls.Page): Get a Page object of this article, from a specified date and in a specified translation.

Example properties

-- TODO

`wikipls.Page(*args, lang=consts.LANG, from_article: Article = None)`

A "Page" is a version of an article in a specific date and a specific language.

from_article is for pages that were created using wikipls.Article().get_page(). Don't worry about it:)

Arguments:

(key: str, date: datetime.date | str)
(page_id: RevisionId)

Properties

.from_article (str): Article object of origin, if there is one.
.title (str): Page title.
.key (str): The key of the page (URL-friendly name).).
.raw_text (str): Raw text of this page.
.article_id (ArticleId): ID of the article this page is derived from.
.revision_id (RevisionId): ID of the current revision of the article.
.date (datetime.date): The date of the page.
.lang (str): The language of the page as an ISO 639 code (e.g. "en" for English).
.content_model (str): Type of wiki project this page is a part of (e.g. "wikitext", "wikionary").
.license (dict): Details about the copyright license of the page.
.views (int): Number of vists this page has received during its specified date.
.html (str): Page HTML.
.summary (dict): Summary of the page. .summary["html"] is the HTML version and the .summary["text"] is the plain-text version.
.media (tuple[dict, ...]): All media files in the page represented as JSONs.
.as_pdf (bytes): The PDF version of the page in bytes-code.
.data (dict[str, Any]): General details about the page in JSON format.
.lint (?)
.mobile_html (str)
.mobile_html_offline_resources (str)
.mobile_sections (str)
.mobile_sections_lead (str)
.mobile_sections_remaining (str)
.discussion (dict): Info about the behind-the-scenes discussion of this page.
.related (tuple[dict]): Articles that are related to this one.
.article_details (dict): Details related to the article the page is derived from.
.page_details (dict): Details related to the page.\

.get_revision(): Get the relevant revision of this page.

Example properties

-- TODO

`Revision(*args, lang=consts.LANG)`

A revision is a change made to an article by the Wikipedia editors.
So if you scrumble an entire paragraph of the Moon article then submit it to the wiki, then that scrumble is a Revision. The scrumbled result is a Page, and so is the article before you scrumbled it.
All of this was presumebly done in the Moon Article.

Arguments:

Revision(key: str, date: datetime.date)
Revision(page_id: RevisionId)

Properties

.id (RevisionId): Revision ID.
.lang (str): Page language code.
.datetime (datetime.datetime): Time this revision was published on Wikipedia.
.content_model (str): Type of wiki project this page is a part of (e.g. "wikitext", "wikionary").
.license (dict): Details about the copyright license of the page.
.html_url (str): URL to an html version of the current revision of the article.
.comment (str): Comment of revision.
.user (dict): Data about the user who published the revision.
.size (int):
.is_minor (bool):
.delta (int):
.details (dict): Details related to the current revision of the page.\

.article_title (str): Article title.
.article_key (str): Article key (URL-friendly name).
.article_id (ArticleId): Article ID. Doesn't change across revisions.\

What does the name mean?

Wiki = Wikipedia
Pls = Please, because you make requests

Versions

This version of the package is written in Python. I plan to eventually make a copy of this one written in Rust (using PyO3 and maturin). Why Rust? It's an exercise for me, and it will be way faster and less error-prone

Plans

Support for more languages (Currently supports only English Wikipedia)
Dictionary
Citations

Bug reports

This package is in early development and I'm looking for community feedback on bugs.
If you encounter a problem, please report it in Issues.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.1a8 pre-release

Dec 3, 2024

This version

0.0.1a7 pre-release yanked

Nov 25, 2024

Reason this release was yanked:

This version is missing requirements.txt file and thus crashes on import

0.0.1a6 pre-release

Nov 19, 2024

0.0.1a5 pre-release

Nov 19, 2024

0.0.1a4 pre-release

Nov 18, 2024

0.0.1a3 pre-release

Nov 16, 2024

0.0.1a2 pre-release

Nov 16, 2024

0.0.1a1 pre-release

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_wikipls-0.0.1a7.tar.gz (39.8 kB view details)

Uploaded Nov 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_wikipls-0.0.1a7-py3-none-any.whl (34.3 kB view details)

Uploaded Nov 25, 2024 Python 3

File details

Details for the file py_wikipls-0.0.1a7.tar.gz.

File metadata

Download URL: py_wikipls-0.0.1a7.tar.gz
Upload date: Nov 25, 2024
Size: 39.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for py_wikipls-0.0.1a7.tar.gz
Algorithm	Hash digest
SHA256	`ffd97466a6c6e7efb720cb5371cac341f3dad8720219f826315cad2a9047520e`
MD5	`458fe476de60099519e5178838f0fd62`
BLAKE2b-256	`c37d1faf346851cb624da912979f0040a8867cee8bdc0ca6a407690a01025496`

See more details on using hashes here.

File details

Details for the file py_wikipls-0.0.1a7-py3-none-any.whl.

File metadata

Download URL: py_wikipls-0.0.1a7-py3-none-any.whl
Upload date: Nov 25, 2024
Size: 34.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for py_wikipls-0.0.1a7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dfa3f57f0debdda339f8341afab80d11381f7a97c8a9dc48976db448035c562a`
MD5	`579f58dd116d5e4b21962f68e5184da5`
BLAKE2b-256	`09d8b0ed1cf3ffe485f5eee9675fe7b2ac34d86435a26122fb298c48b06fa10d`

See more details on using hashes here.

py-wikipls 0.0.1a7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is this?

Why does it exist?

Installation

How to use

Key

GET functions

get_summary(key: str, fmt: str = "text") -> str

get_raw_text(key: str, id: RevisionId) -> str

get_html(key: str, old_id: RevisionId = None) -> str

get_key(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_date(id: RevisionId, lang: str = consts.LANG) -> datetime.date

get_lint(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict

get_mobile_html(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_mobile_html_offline_resources(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_mobile_sections(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_mobile_sections_lead(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_mobile_sections_remaining(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str

get_discussion(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict

get_related(key: str, lang: str = consts.LANG) -> tuple[dict]

get_views(name: str, date: str | datetime.date, lang: str = LANG) -> int

get_pdf(key: str) -> bytes

get_media_details(key: str, old_id: RevisionId = None) -> tuple[dict, ...]

get_image(details: dict[str, ...]) -> bytes

get_all_images(key: str, details: Iterable[dict[str, ...]], strict: bool = False) -> tuple[bytes]

Data functions

get_article_data(identifier: str | ArticleId, lang: str = consts.LANG) -> dict[str, ...]

get_page_data(*args, lang: str = consts.LANG) -> dict[str, ...]

get_revision_data(*args, lang: str = consts.LANG) -> dict[str, ...]

Utility functions

to_timestamp(date: datetime.date | str) -> str

from_timestamp(timestamp: str, out_fmt: type[datetime.date] | type[datetime.datetime] = datetime.date) -> datetime.date | datetime.datetime

id_of_page(*args, lang: str = consts.LANG) -> RevisionId

key_of_page(id: ArticleId | RevisionId, lang=consts.LANG) -> str

response_for(url: str, params: dict = None) -> str

json_response(url: str, params: dict = None) -> dict

ID classes

Wiki objects

wikipls.Article(key: str)

Properties

Example properties

wikipls.Page(*args, lang=consts.LANG, from_article: Article = None)

Properties

Example properties

Revision(*args, lang=consts.LANG)

Properties

What does the name mean?

Versions

Plans

Bug reports

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`get_summary(key: str, fmt: str = "text") -> str`

`get_raw_text(key: str, id: RevisionId) -> str`

`get_html(key: str, old_id: RevisionId = None) -> str`

`get_key(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_date(id: RevisionId, lang: str = consts.LANG) -> datetime.date`

`get_lint(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict`

`get_mobile_html(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_mobile_html_offline_resources(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_mobile_sections(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_mobile_sections_lead(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_mobile_sections_remaining(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> str`

`get_discussion(key: str, old_id: RevisionId = None, lang: str = consts.LANG) -> dict`

`get_related(key: str, lang: str = consts.LANG) -> tuple[dict]`

`get_views(name: str, date: str | datetime.date, lang: str = LANG) -> int`

`get_pdf(key: str) -> bytes`

`get_media_details(key: str, old_id: RevisionId = None) -> tuple[dict, ...]`

`get_image(details: dict[str, ...]) -> bytes`

`get_all_images(key: str, details: Iterable[dict[str, ...]], strict: bool = False) -> tuple[bytes]`

`get_article_data(identifier: str | ArticleId, lang: str = consts.LANG) -> dict[str, ...]`

`get_page_data(*args, lang: str = consts.LANG) -> dict[str, ...]`

`get_revision_data(*args, lang: str = consts.LANG) -> dict[str, ...]`

`to_timestamp(date: datetime.date | str) -> str`

`from_timestamp(timestamp: str, out_fmt: type[datetime.date] | type[datetime.datetime] = datetime.date) -> datetime.date | datetime.datetime`

`id_of_page(*args, lang: str = consts.LANG) -> RevisionId`

`key_of_page(id: ArticleId | RevisionId, lang=consts.LANG) -> str`

`response_for(url: str, params: dict = None) -> str`

`json_response(url: str, params: dict = None) -> dict`

`wikipls.Article(key: str)`

`wikipls.Page(*args, lang=consts.LANG, from_article: Article = None)`

`Revision(*args, lang=consts.LANG)`