Dorieh Data Engineering Platform
Project description
Dorieh Data Platform for population and environmental health
Detailed documentation: Dorieh Documentation
Dorieh overview
Dorieh Data Platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL).
We have tested deployment with the following CWL implementations:
- Toil.
- CWL reference implementation, primarily using cwlref-runner package
- CWL-Airflow that provides a very nice Airflow graphical user interface (GUI) for running workflows.
The data produced by the data processing workflows is eventually stored in either CSV files, a PostgreSQL DBMS or Parquet files. Dorieh also supports storing results in FST and HDF5 files.
Some of the included data processing workflows use “Extract, Load, Transform,” (ELT) paradigm rather than more traditional “Extract, Transform, Load” ETL. It means that these workflows perform calculations, translations, filtering, cleansing, de-duplicating, validating, and data analysis or summarizations inside a DBMS using DBMS internal tools.
The data platform supports tools written in widely used languages such as Python, C/C++ and Java, R and PL/pgSQL.
Setting up
Python Virtual Environment
Install Toil:
pip install "toil[cwl,aws]"
Install Dorieh (stable version):
pip install dorieh
If you prefer to install the latest version from GitHub:
pip install git+https://github.com/NSAPH-Data-Platform/dorieh
If FST support is desired, R runtime has to be installed and R_HOME environment variable set up. One of the simples ways of installing R is to use Conda package manager. Once R is set up, install Dorieh with either of the following command:
pip install dorieh[FST]
pip install "git+https://github.com/NSAPH-Data-Platform/dorieh[FST]"
Docker Container
To build your own Dorieh Docker image see docker directory
A prebuilt docker image with Dorieh is provided:
docker pull forome/dorieh
Built-in Workflows
For examples of data processing workflows, see included data processing workflows
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dorieh-0.4.0.tar.gz.
File metadata
- Download URL: dorieh-0.4.0.tar.gz
- Upload date:
- Size: 14.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56a54bd2764f01370a3c4cacf18720eec31c1b1d42c72e98987aeb5313a71965
|
|
| MD5 |
552e01380f1d234b64200ba675bcbd0e
|
|
| BLAKE2b-256 |
6a100b5f5e13511e169cadce12b27637bc11e733eaec6aefc4262f4709d93e98
|
File details
Details for the file dorieh-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dorieh-0.4.0-py3-none-any.whl
- Upload date:
- Size: 14.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd4c81bc3b7c10c10fa59d4c1e3b651f7bddb35d5d380a73f07da1f612c8eb4
|
|
| MD5 |
c841965ffefd5427b620720993931277
|
|
| BLAKE2b-256 |
f700ae42d808980da6e3d2050400ce7ca4409609071c81b47b3fb55c7edae671
|