Library providing batch upload & monitoring for (DANE) processing environments
Project description
Introduction
Python library for creating "processing workflows" that use DANE environments, which in a nutshell offer, depending on the setup of each environment, an API for some kind of multi-media processing, e.g.:
- Automatic Speech Recognition
- Named Entity Extraction
- Computer Vision algorithms
- Any kind of Machine Learning algorithm
This Python library is however not limited to using DANE, but cannot also be used to hook up any API that does something with generating certain data from certain input data.
Achitecture
The following image illustrates the dane-workflows architecture:
The following section details more about concepts illustrated in the image.
Definition of a workflow
A workflow is able to iteratively:
- obtain input/source data from a
DataProvider
- send it to a
ProcessingEnvironment
(e.g. DANE environment) - wait for the processing environment to complete its work
- obtain results from the processing environment
- pass results to an
Exporter
, which typically reconsiles the processed data with the source data
As mentioned in the definition of a workflow, this Python library works with the following components/concepts:
TaskScheduler
Main process that handles all the steps described in the Definition of a workflow
StatusHandler
Keeps track of the workflow status, esuring recovery after crashes. By default the status is persisted to a SQLite database file, using the SQLiteStatusHandler
but other implementations can be made by subclassing StatusHandler
.
StatusMonitor
Note: This component is currently implemented and not yet available.
Runs on top of the StatusHandler database and visualises the overall progress of a workflow in a human-readable manner (e.g. show the % of successfully/failed processed items)
DataProvider
Iteratively called by the TaskScheduler
to obtain a new batch of source data. No default implementations are available (yet), since there are many possible ways one would want to supply data to a system. Simply subclass from DataProvider
to have full control over your input flow.
DataProcessingEnvironment
Iteratively called by the TaskScheduler
to submit batches of data to an (external) processing environment. Also takes care of obtaining the output of finished processes from such an environment.
This library contains a full implementation, DANEEnvironment
, for interacting with DANE environments, but other environments/APIs can be supported by subclassing from ProcessingEnvironment
.
Exporter
Called by the TaskScheduler
with output data from a processing environment. No default implementation is available (yet), since this is typically the most use-case sensitive part of any workflow, meaning you should decide what to do with the output data (by subclassing Exporter
).
Getting started
Prerequisites
- Python >= 3.8 <= 3.10
- Poetry
Installation
Install via pypi.org, using e.g.
pip install dane-workflows
local development
Run poetry install
. After completion run:
poetry shell
To test the contents of this repository works well, run:
./scripts/check-project.sh
TODO finalise
Usage
After installing dane-workflows in your local environment, you can run an example workflow with:
python main.py
This example script uses config-example.yml
to configure and run a workflow using the following implementations:
- DataProvider: ExampleDataProvider (with two dummy input documents)
- DataProcessingEnvironment: ExampleDataProcessingEnvironment (mocks processing environment)
- StatusHandler: SQLiteStatusHandler (writes output to
./proc_stats/all_stats.db
) - Exporter: ExampleExporter (does nothing with results)
To setup a workflow for your own purposes, consider the following:
What data do I want to process?
We've provided the ExampleDataProvider
to easily feed a workflow with a couple of files (via config.yml
). This is mostly for testing out your workflow.
Mostly likely you'll need to implement your own DataProvider
by subclassing it. This way you can e.g. load your input data from a database, spreadsheet or whatever else you need.
Which processing environment will I use?
Since this project is developed to at least interface with running DANE environments we've provided DANEEnvironment
as a default implementation of DataProcessingEnvironment
.
In case you'd like to call any other tool for processing your data, you're required to implement a subclass of DataProcessingEnvironment
.
What I will I do with the output of the processing environment?
After your DataProcessingEnvironment
has processed a batch of items from your DataProvider
the TaskScheduler
hands over the output data to your subclass of Exporter
.
Since this is the most use-case dependant part of any workflow, we do not provide any useful default implementation.
Note: ExampleExporter
is only used as a placeholder for tests or dry runs.
Roadmap
- Implement more advanced recovery
- Add example workflows (refer in README)
- Finalise initial README
- Add Python docstring
See the open issues for a full list of proposed features, known issues and user questions.
License
Distributed under the MIT License. See LICENSE.txt
for more information.
Contact
Use the issue tracker for any questions concerning this repository
Project Link: https://github.com/beeldengeluid/dane-workflows
Codemeta.json requirements: https://github.com/CLARIAH/clariah-plus/blob/main/requirements/software-metadata-requirements.md
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dane_workflows-0.2.3.tar.gz
.
File metadata
- Download URL: dane_workflows-0.2.3.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.10.16.3-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 516e5911a2fbd53f055eec9f3fad62fb71041643f636cdd1215c733ded84f2a9 |
|
MD5 | 10dabc2f477adb553a77fafcb3011f32 |
|
BLAKE2b-256 | 11a540cf9c143c908b6621c66424b819411de7bee53c668116d6ded66ffd3833 |
File details
Details for the file dane_workflows-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: dane_workflows-0.2.3-py3-none-any.whl
- Upload date:
- Size: 37.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.10.16.3-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67c9c738fd2c29259a0445aba9add5a6e8580a892dd8e58c3e9337c3f17ba735 |
|
MD5 | 0783f13c2960f37b4324c1cc3c20be93 |
|
BLAKE2b-256 | eba3dc9d625cef6d71b754262344b7c6eafb45ce8ed5064ebc0125bd317d1f0a |