Parse wiki pages and create dictionaries, fast, with Python

These details have not been verified by PyPI

Project links

Project description

WikiPrism

Parse wiki pages and create dictionaries, fast, with Python

Introduction

WikiPrism is a Python library designed to:

Parse a wiki XML file in order to extract its pages - contained within <page> tags
Parse each page so as to extract terms from it - that will be added to a custom Dictionary instance

The above tasks can be easily combined in a sophisticated but user-friendly automated process called extraction pipeline - configurable via a dedicated descriptor object.

WikiPrism focuses on speed - a goal achieved by introducing careful design choices, including the transparent use of parallelism via both multithreading and multiprocessing, while orchestrating several elements provided by the Eos library and without losing architectural simplicity.

In this guide, we are going to see an overview of the library.

Installation

To install WikiPrism, just run:

pip install info.gianlucacosta.wikiprism

or, if you are using Poetry:

poetry add info.gianlucacosta.wikiprism

Then, you'll be able to access the info.gianlucacosta.wikiprism package and its subpackages.

Extracting pages from a wiki source

To perform blazing-fast parsing, WikiPrism relies on customized SAX; more precisely, it expects a wiki file having whatever structure one prefers, as long as its pages comply with the following schema:

<page>
  ...
  <title>The page title</title>
  ...
  <text>The page content</text>
  ...
</page>

Autrement dit, as long as the XML file contains - anywhere - one or more <page> tags having the described structure, WikiPrism will be able to detect them.

In practice

Parsing can be performed via Python's standard functions within the xml.sax namespace - especially:

parse() - to parse an XML file
parseString() - to parse an XML string

Both functions require a ContentHandler and an optional ErrorHandler - and that's precisely why WikiPrism provides two dedicated classes:

WikiContentHandler
WikiErrorHandler

In particular, WikiContentHandler's constructor expects:

a callback, that receives a Page object - containing just the title and text attributes - as soon as a valid page is found by the SAX parser
a ContinuationProvider, that is a () -> bool function periodically called: should it return False, the parsing would end by raising a WikiSaxCanceledException

For more details, please consult the docstrings.

Creating dictionaries

Python has dictionaries - intended as hash maps - but WikiPrism introduces the abstract class named Dictionary[TTerm] as a generic container of terms, which can be language elements like nouns, verbs, conjunctions, or even anything else, according to one's linguistic model... the exact purpose of each dictionary is up to the developer.

Dictionary[TTerm] is a sort of generic repository for terms, via its 2 main abstract methods:

add_term(): adds a term to the dictionary - actually writing to the storage technology and maybe preventing duplication
execute_command(): runs a command string on the (arbitrary) underlying storage and returns a Result[DictionaryView] - that is, à la Rust, either a DictionaryView object (a table-like DTO with both headers and data rows) or an Exception

Dictionary also has other abstract methods to implement, but it never makes assumptions about its internal data storage; consequently, for convenience, WikiPrism also provides concrete subclasses:

InMemoryDictionary[TTerm] - adding terms to a Python set, but unable to perform commands
SqliteDictionary[TTerm] - dictionary backed by a SQLite db and passing commands to the related SQL interpreter

For further details, please consult the docstrings.

The extraction pipeline

To combine wiki page extraction and dictionary creation into a performant, automated and easy-to-use process, WikiPrism defines extraction pipelines.

Running an extraction pipeline basically resolves to:

Creating a custom subclass of PipelineStrategy[TTerm] - or its SQLite-oriented subclass, SqlitePipelineStrategy[TTerm]
Invoking the run_extraction_pipeline() function, which only expects an instance of the strategy

run_extraction_pipeline() executes the pipeline in a separate thread (plus its own subthreads and a process pool), therefore it returns a PipelineHandle - an object with the following methods:

join() - to wait for its completion - and supporting the same parameters as Thread's join() method
request_cancel() - to stop the pipeline in a clean way, as soon as possible

For more details, please consult the docstrings, the tests, and possibly the whole open source Cervantes project.

Related projects

Cervantes - WikiPrism applied to Wikcionario in order to explore Spanish morphology with Python
Eos-core - type-checked and dependency-free modern utility library for Python

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 19, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

info.gianlucacosta.wikiprism-1.0.0.tar.gz (15.2 kB view details)

Uploaded Apr 19, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

info.gianlucacosta.wikiprism-1.0.0-py3-none-any.whl (18.0 kB view details)

Uploaded Apr 19, 2022 Python 3

File details

Details for the file info.gianlucacosta.wikiprism-1.0.0.tar.gz.

File metadata

Download URL: info.gianlucacosta.wikiprism-1.0.0.tar.gz
Upload date: Apr 19, 2022
Size: 15.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure

File hashes

Hashes for info.gianlucacosta.wikiprism-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`abdc2c6792bfc7fc37d32c27b000c59b17ca8096ebf0af99c7863067fa28ae8b`
MD5	`4b9cf37c2a0d642977ccf1d9a380bdd8`
BLAKE2b-256	`e8f09786dc47377ec7f2009cb5476b1cffc491675f05d838b22434408e8a8b79`

See more details on using hashes here.

File details

Details for the file info.gianlucacosta.wikiprism-1.0.0-py3-none-any.whl.

File metadata

Download URL: info.gianlucacosta.wikiprism-1.0.0-py3-none-any.whl
Upload date: Apr 19, 2022
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure

File hashes

Hashes for info.gianlucacosta.wikiprism-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8e774d93a95e6837c90dfc0f5f86453ac65e72b3770dd596e9fc7d0d4c9dab9`
MD5	`f133110f527aeef91d733817eed95b7c`
BLAKE2b-256	`cbded90c7d9aa952e8eef8c9f896bf7537a666d180c47ca07ffe14e35e1b9e57`

See more details on using hashes here.

info.gianlucacosta.wikiprism 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WikiPrism

Introduction

Installation

Extracting pages from a wiki source

In practice

Creating dictionaries

The extraction pipeline

Related projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes