Skip to main content

Use MediaWiki Wiki page content as read-only database

Project description

wiki_as_base-py

[MVP] Use MediaWiki Wiki page content as read-only database. Python library implementation. See https://github.com/fititnt/openstreetmap-serverless-functions/tree/main/function/wiki-as-base

GitHub Pypi: wiki_as_base

Installing

pip install wiki_as_base --upgrade

Usage

Environment variables

Customize for your needs. They're shared between command line and the library.

export WIKI_API='https://wiki.openstreetmap.org/w/api.php'
export WIKI_NS='osmwiki'

Command line

wiki_as_base --help

## Use remote storage (defined on WIKI_API)
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'

# The output is JSON-LD. Feel free to further filter the data
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq .data[1]

## Example of, instead of use WIKI_API, parse Wiki markup directly. Output JSON- LD
cat tests/data/multiple.wiki.txt | wiki_as_base --input-stdin

## Output zip file instead of JSON-LD. --verbose also adds wikiasbase.jsonld to file
cat tests/data/chatbot-por.wiki.txt | wiki_as_base --input-stdin --verbose --output-zip-file tests/temp/chatbot-por.zip

## Use different Wiki with ad-hoc change of the env WIKI_API and WIKI_NS
WIKI_NS=wikidatawiki \
  WIKI_API=https://www.wikidata.org/w/api.php \
  wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'
Click to see more examples for other wikies
# For suggestion of RDF namespaces, see https://dumps.wikimedia.org/backup-index.html
WIKI_NS=specieswiki \
  WIKI_API=https://species.wikimedia.org/w/api.php \
  wiki_as_base --titles 'Paubrasilia_echinata'

# @TODO implement support for MediaWiki version used by wikies like this one
WIKI_NS=smwwiki \
  WIKI_API=https://www.semantic-mediawiki.org/w/api.php \
  wiki_as_base --titles 'Help:Using_SPARQL_and_RDF_stores'

Use of permanent IDs for pages, the WikiMedia pageids

In case the pages are already know upfront (such as automation) then the use of numeric pageid is a better choice.

# "--pageids '295916'" is equivalent to "--titles 'User:EmericusPetro/sandbox/Wiki-as-base'"
wiki_as_base --pageids '295916'

However, if for some reason (such as strictly enforce not just an exact page, but exact version of one or more pages) and getting the latest version is not fully essential, then you can use revids,

# "--revids '2460131'" is an older version of --pageids '295916' and
# "--titles 'User:EmericusPetro/sandbox/Wiki-as-base'"
wiki_as_base --revids '2460131'

Request multiple pages at once, either by pageid or titles

Each MediaWiki API may have different limits for batch requests, however even unauthenticated users often have decent limits (e.g. 50 pages).

Some Wikies may allow very high limits for authenticated accounts (500 pages), however the current version does not implement authenticated requests.

## All the following commands are equivalent for the default WIKI_API

wiki_as_base --input-autodetect '295916|296167'
wiki_as_base --input-autodetect 'User:EmericusPetro/sandbox/Wiki-as-base|User:EmericusPetro/sandbox/Wiki-as-base/data-validation'
wiki_as_base --pageids '295916|296167'
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base|User:EmericusPetro/sandbox/Wiki-as-base/data-validation'

Trivia: since this library and CLI fetch directly from WikiMedia API, and parse Wikitext (not raw HTML), it causes much less server load to request several pages this way than big ones with higher number of template calls 😉.

Advanced filter with jq

When working with the JSON-LD output, you can use jq ("jq is a lightweight and flexible command-line JSON processor."), see more on https://stedolan.github.io/jq/, to filter the data

## Filter tables
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq '.data[] | select(.["@type"] == "wtxt:Table")'

## Filter Templates
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' | jq '.data[] | select(.["@type"] == "wtxt:Template")'

Save JSON-LD extracted as files

TODO: explain the implemented feature

Library

WARNING: as 2023-12-05 while the command line is less likely to change, the internal calls of this library, names of functions, etc are granted to not be stable.

You can import as a pip package, however set the exact version in special if it is unnatended deployment (e.g. GitHub actions, etc).

# requirements.txt
wiki_as_base==0.5.5

Basic use

import json
from wiki_as_base import WikitextAsData

wtxt = WikitextAsData().set_pages_autodetect("295916|296167")
wtxt_jsonld = wtxt.output_jsonld()

print(f'Total: {len(wtxt_jsonld["data"])}')

for resource in wtxt_jsonld["data"]:
    if resource["@type"] == "wtxt:Table":
        print("table found!")
        print(resource["wtxt:tableData"])

print("Pretty print full JSON output")

print(json.dumps(wtxt.output_jsonld(), ensure_ascii=False, indent=2))

The Specification

The temporary docs page is at https://fititnt.github.io/wiki_as_base-py/

Disclaimer / Trivia

The wiki_as_base allows no-as-complete data extraction from MediaWiki markup text directly by its API or direct input, without need to install server extensions.

Check also the wikimedia/Wikibase, a full server version (which inspired the name).

License

Public domain

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiki_as_base-0.5.5.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

wiki_as_base-0.5.5-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file wiki_as_base-0.5.5.tar.gz.

File metadata

  • Download URL: wiki_as_base-0.5.5.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.8.2 pkginfo/1.7.0 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10

File hashes

Hashes for wiki_as_base-0.5.5.tar.gz
Algorithm Hash digest
SHA256 8e30372025fe40432518c8178a90124e13fa73be11bd34b5445c7add26ea65fc
MD5 4794dc3fbf645e8d011404e6a3a10fb0
BLAKE2b-256 f5729bc26f51dcc12a7385edc49dccd66084c1dc314ee78a54f6182af1f0f85e

See more details on using hashes here.

File details

Details for the file wiki_as_base-0.5.5-py3-none-any.whl.

File metadata

  • Download URL: wiki_as_base-0.5.5-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.8.2 pkginfo/1.7.0 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10

File hashes

Hashes for wiki_as_base-0.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d4fa356c71d411239d6b03d1117461111d25a713b7cdde65bb88efe3d8cf0da5
MD5 925b1096cc4762b4f64b9b4a6c7144af
BLAKE2b-256 661a6bb15a522b560e0055827fb1c665a8b074d35144b905d59fd8a824a52fd4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page