Skip to main content

A package for requesting data from Wikipedia using the REST API.

Project description

What is this?

Wikipls is a Python package meant to easily scrape data out of Wikipedia (maybe more of the Wikimedia in the future), using its REST API. This package is still in early development, but it has basic functionality all set.

Why does it exist?

The REST API for wikimedia, isn't the most intuitive and requires some learning. When writing code, it also requires setting up a few functions to make it more manageable and readable. So essentially I made these functions and packaged them so that you (and I) won't have to rewrite them every time. While I'm at it I made them more intuitive and easy to use without needing to figure out how this API even works.

Installation

To install use:
pip install py-wikipls

Then in your code add:
import wikipls

How to use

I haven't made any documentation page yet, so for now the below will have to do.
If anything is unclear don't hesitate to open an issue in Issues.
Updated for version: 0.0.1a6

Key

Many functions in this package require the name of the Wiki page you want to check in a URL-friendly format. The REST documentation refers to that as a the "key" of an article. For example:

  • The key of the article titled "Water" is: "Water"
  • The key of the article titled "Faded (Alan Walker song)" is: "Faded_(Alan_Walker_song)"
  • The key of the Article titled "Georgia (U.S. state)" is: "Georgia_(U.S._state)"

That key is what you enter in the name parameter of functions. The key is case-sensitive.

To get the key of an article you can:

  1. Take a look at the url of the article.
    The URL for "Faded" for example is "https://en.wikipedia.org/wiki/Faded_(Alan_Walker_song)". Notice it ends with "wiki/" followed by the key of the article.
  2. Take the title of the article and replace all spaces with "_", it'll probably work just fine.
  3. In the future there will be a function to get the key of a title.

Direct Functions

These functions can be used without needing to create an object. In general they all require the URL-friendly name of an article as a string.

get_views(name: str, date: str | datetime.date, lang: str = LANG) -> int

Returns the number of times people visited an article on a given date.

The date given can be either a datetime.date object or a string formatted yyyymmdd (So March 31th 2024 will be "20240331").

>>> get_views("Faded_(Alan_Walker_song)", "20240331")
1144

The Faded page on Wikipedia was visited 1,144 on March 31st 2024.

get_html(name: str) -> str

Returns the html of the page as a string. This can be later parsed using tools like BeautifulSoup.

>>> get_html("Faded_(Alan_Walker_song)")[:40]
'<!DOCTYPE html>\n<html prefix="dc: http:/'

This example returns the beginning of the html of the "Faded" page.

get_summary(name: str) -> str

Returns a summary of the page.

>>> get_summary("Faded_(Alan_Walker_song)")[:120]
'"Faded" is a song by Norwegian record producer and DJ Alan Walker with vocals provided by Norwegian singer Iselin Solhei'

This examples returns the first 120 letters of the summary of the Faded page

get_media_details(name: str) -> tuple[dict, ...]

Returns all media present in the article, each media file represented as a JSON.

>>> get_media_details("Faded_(Alan_Walker_song)")[0]
{'title': 'File:Alan_Walker_-_Faded.png', 'leadImage': False, 'section_id': 0, 'type': 'image', 'showInGallery': True, 'srcset': [{'src': '//upload.wikimedia.org/wikipedia/en/thumb/d/da/Alan_Walker_-_Faded.png/220px-Alan_Walker_-_Faded.png', 'scale': '1x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '1.5x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '2x'}]}

This example returns the first media file in the Faded article, which is a PNG image.

get_image(details: dict[str, ...]) -> bytes

Retrives the actual byte-code of an image on a an article, using a JSON representing the image. You can get that JSON using get_media_details().

>>> get_image({'title': 'File:Alan_Walker_-_Faded.png', 'leadImage': False, 'section_id': 0, 'type': 'image', 'showInGallery': True, 'srcset': [{'src': '//upload.wikimedia.org/wikipedia/en/thumb/d/da/Alan_Walker_-_Faded.png/220px-Alan_Walker_-_Faded.png', 'scale': '1x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '1.5x'}, {'src': '//upload.wikimedia.org/wikipedia/en/d/da/Alan_Walker_-_Faded.png', 'scale': '2x'}]})
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01,\x00\x00\x01,\x08\x03\x00\x00\x00N\xa3~G\x00\x00\x03\x00PLTE\xff\xff\xff\x01\x01\x01\xfe\xfd\xfe'

This examples returns the first bytes of the image we got in the get_media_details() example.

get_all_images(input: str | Iterable[dict[str, ...]], strict: bool = True) -> tuple[bytes]

Returns all images of an article or a provided list of image-JSONs, in bytes form.

get_pdf(name: str) -> bytes

Returns the PDF version of the page in byte-form.

>>> with open("faded_wiki.pdf", 'wb') as f:
f.write(get_pdf("Faded_(Alan_Walker_song)"))

This example imports the Faded page in PDF form as a new file named "faded_wiki.pdf".

get_page_data(name: str, date: str | datetime.date) -> dict

Returns details about the latest revision to the page in JSON form.
If date is provided, returns the latest revision details as of that date.

get_article_data(identifier: str | int, lang: str = LANG) -> dict[str, ...]

Returns details about an article in JSON form.
Identifier can be either the article's name or its ID.

to_timestamp(date: datetime.date) -> str

Converts a datetime.date object or a string in format yyyy-mm-ddThh:mm:ssZ to a URL-friendly string format (yyyymmdd)

>>> date = datetime.date(2024, 3, 31)
>>> to_timestamp(date)
20240331

This example converts the date of March 31th 2024 to URL-friendly string form.

from_timestamp(timestamp: str) -> datetime.date

Converts a timestamp to a datetime.date object.
The timestamp is a string which is written in one of the following formats:

  • yyyymmdd
  • yyyy-mm-ddThh:mm:ssZ

id_of_page(name: str, date: str | datetime.date, lang: str = LANG) -> int

Returns an id of a page, given a name.
Date argument is optional: If date is provided, returns the ID of latest revision as of that date.

name_of_page(id: int, lang=LANG) -> str

Returns the title (not key!) of an article given its ID.

Class objects

If you intend on repeatedly getting info about some page, it is preferred that you make an object for that page.
This is for reasons of performance as well as readability and organization.

wikipls.Article(name: str)

An "Article" is a wikipedia article in all of its versions, revisions and languages.

Properties

.name (str): Article title.
.key (str): Article key (URL-friendly name).
.id (int): Article ID. Doesn't change across revisions.
.content_model (str): Type of wiki project this article is a part of (e.g. "wikitext", "wikionary").
.license (dict): Details about the copyright license of the article.
.latest (dict): Details about the latest revision done to the article.
.html_url (str): URL to an html version of the current revision of the article.
.details (dict[str, Any]): All the above properties in JSON form.\

.get_page(date: datetime.date, lang: str = "en") (wikipls.Page): Get a Page object of this article, from a specified date and in a specified translation.

Example properties

-- TODO

wikipls.Page(article: Article, date: datetime.date)

A "Page" is a version of an article in a specific date and a specific language, a.k.a a "revision".

Properties

.name (str): Page title.
.key (str): The key of the page (URL-friendly name).).
.article_id (int): ID of the article this page is derived from.
.revision_id (int): ID of the current revision of the article.
.date (datetime.date): The date of the page.
.lang (str): The language of the page as an ISO 639 code (e.g. "en" for English).
.content_model (str): Type of wiki project this page is a part of (e.g. "wikitext", "wikionary").
.license (dict): Details about the copyright license of the page.
.views (int): Number of vists this page has received during its specified date.
.html (str): Page HTML.
.summary (str): Summary of the page.
.media (tuple[dict, ...]): All media files in the page represented as JSONs.
.as_pdf (bytes): The PDF version of the page in bytes-code.
.data (dict[str, Any]): General details about the page in JSON format.
.article_details (dict): Details related to the article the page is derived from.
.page_details (dict): Details related to the current revision of the page.\

Example properties

-- TODO

What does the name mean?

Wiki = Wikipedia
Pls = Please, because you make requests

Versions

This version of the package is written in Python. I plan to eventually make a copy of this one written in Rust (using PyO3 and maturin). Why Rust? It's an exercise for me, and it will be way faster and less error-prone

Plans

  • Support for more languages (Currently supports only English Wikipedia)
  • Dictionary
  • Citations

Bug reports

This package is in early development and I'm looking for community feedback on bugs.
If you encounter a problem, please report it in Issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_wikipls-0.0.1a6.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

py_wikipls-0.0.1a6-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file py_wikipls-0.0.1a6.tar.gz.

File metadata

  • Download URL: py_wikipls-0.0.1a6.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for py_wikipls-0.0.1a6.tar.gz
Algorithm Hash digest
SHA256 3db98aa9834b92172b0b549a8f3769d01303ec800a833c723368be8e0e41ee01
MD5 456a3c0a09fe7aa440e662f5d6c87971
BLAKE2b-256 df85a79004f8595ad454155527da4711181155f00403703082369150b27afb58

See more details on using hashes here.

File details

Details for the file py_wikipls-0.0.1a6-py3-none-any.whl.

File metadata

  • Download URL: py_wikipls-0.0.1a6-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for py_wikipls-0.0.1a6-py3-none-any.whl
Algorithm Hash digest
SHA256 bca1093a9a7e800a4c8d0ed74068c63321c68e81bfe8d46eef26c2d784009c28
MD5 8ccd1c9586d004b713545a326ff842da
BLAKE2b-256 796ca606064f11aecd08b7d00e7385f2b084b664e4a16c04ff01bea71967da9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page