Parse wiki pages and create dictionaries, fast, with Python
Project description
WikiPrism
Parse wiki pages and create dictionaries, fast, with Python
Introduction
WikiPrism is a Python library designed to:
-
Parse a wiki XML file in order to extract its pages - contained within <page> tags
-
Parse each page so as to extract terms from it - that will be added to a custom Dictionary instance
The above tasks can be easily combined in a sophisticated but user-friendly automated process called extraction pipeline - configurable via a dedicated descriptor object.
WikiPrism focuses on speed - a goal achieved by introducing careful design choices, including the transparent use of parallelism via both multithreading and multiprocessing, while orchestrating several elements provided by the Eos library and without losing architectural simplicity.
In this guide, we are going to see an overview of the library.
Installation
To install WikiPrism, just run:
pip install info.gianlucacosta.wikiprism
or, if you are using Poetry:
poetry add info.gianlucacosta.wikiprism
Then, you'll be able to access the info.gianlucacosta.wikiprism package and its subpackages.
Extracting pages from a wiki source
To perform blazing-fast parsing, WikiPrism relies on customized SAX; more precisely, it expects a wiki file having whatever structure one prefers, as long as its pages comply with the following schema:
<page>
...
<title>The page title</title>
...
<text>The page content</text>
...
</page>
Autrement dit, as long as the XML file contains - anywhere - one or more <page> tags having the described structure, WikiPrism will be able to detect them.
In practice
Parsing can be performed via Python's standard functions within the xml.sax namespace - especially:
-
parse() - to parse an XML file
-
parseString() - to parse an XML string
Both functions require a ContentHandler and an optional ErrorHandler - and that's precisely why WikiPrism provides two dedicated classes:
-
WikiContentHandler
-
WikiErrorHandler
In particular, WikiContentHandler's constructor expects:
-
a callback, that receives a Page object - containing just the title and text attributes - as soon as a valid page is found by the SAX parser
-
a ContinuationProvider, that is a () -> bool function periodically called: should it return False, the parsing would end by raising a WikiSaxCanceledException
For more details, please consult the docstrings.
Creating dictionaries
Python has dictionaries - intended as hash maps - but WikiPrism introduces the abstract class named Dictionary[TTerm] as a generic container of terms, which can be language elements like nouns, verbs, conjunctions, or even anything else, according to one's linguistic model... the exact purpose of each dictionary is up to the developer.
Dictionary[TTerm] is a sort of generic repository for terms, via its 2 main abstract methods:
-
add_term(): adds a term to the dictionary - actually writing to the storage technology and maybe preventing duplication
-
execute_command(): runs a command string on the (arbitrary) underlying storage and returns a Result[DictionaryView] - that is, à la Rust, either a DictionaryView object (a table-like DTO with both headers and data rows) or an Exception
Dictionary also has other abstract methods to implement, but it never makes assumptions about its internal data storage; consequently, for convenience, WikiPrism also provides concrete subclasses:
-
InMemoryDictionary[TTerm] - adding terms to a Python set, but unable to perform commands
-
SqliteDictionary[TTerm] - dictionary backed by a SQLite db and passing commands to the related SQL interpreter
For further details, please consult the docstrings.
The extraction pipeline
To combine wiki page extraction and dictionary creation into a performant, automated and easy-to-use process, WikiPrism defines extraction pipelines.
Running an extraction pipeline basically resolves to:
-
Creating a custom subclass of PipelineStrategy[TTerm] - or its SQLite-oriented subclass, SqlitePipelineStrategy[TTerm]
-
Invoking the run_extraction_pipeline() function, which only expects an instance of the strategy
run_extraction_pipeline() executes the pipeline in a separate thread (plus its own subthreads and a process pool), therefore it returns a PipelineHandle - an object with the following methods:
-
join() - to wait for its completion - and supporting the same parameters as Thread's join() method
-
request_cancel() - to stop the pipeline in a clean way, as soon as possible
For more details, please consult the docstrings, the tests, and possibly the whole open source Cervantes project.
Related projects
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for info.gianlucacosta.wikiprism-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | abdc2c6792bfc7fc37d32c27b000c59b17ca8096ebf0af99c7863067fa28ae8b |
|
MD5 | 4b9cf37c2a0d642977ccf1d9a380bdd8 |
|
BLAKE2b-256 | e8f09786dc47377ec7f2009cb5476b1cffc491675f05d838b22434408e8a8b79 |
Hashes for info.gianlucacosta.wikiprism-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8e774d93a95e6837c90dfc0f5f86453ac65e72b3770dd596e9fc7d0d4c9dab9 |
|
MD5 | f133110f527aeef91d733817eed95b7c |
|
BLAKE2b-256 | cbded90c7d9aa952e8eef8c9f896bf7537a666d180c47ca07ffe14e35e1b9e57 |