Skip to main content

Dorieh Data Engineering Platform

Project description

Dorieh Data Platform for population and environmental health

Detailed documentation: Dorieh Documentation

Dorieh overview

Dorieh Data Platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL).

We have tested deployment with the following CWL implementations:

The data produced by the data processing workflows is eventually stored in either CSV files, a PostgreSQL DBMS or Parquet files. Dorieh also supports storing results in FST and HDF5 files.

Some of the included data processing workflows use “Extract, Load, Transform,” (ELT) paradigm rather than more traditional “Extract, Transform, Load” ETL. It means that these workflows perform calculations, translations, filtering, cleansing, de-duplicating, validating, and data analysis or summarizations inside a DBMS using DBMS internal tools.

The data platform supports tools written in widely used languages such as Python, C/C++ and Java, R and PL/pgSQL.

Setting up

Python Virtual Environment

Install Toil:

pip install "toil[cwl,aws]"

Install Dorieh (stable version):

pip install dorieh

If you prefer to install the latest version from GitHub:

pip install git+https://github.com/NSAPH-Data-Platform/dorieh

If FST support is desired, R runtime has to be installed and R_HOME environment variable set up. One of the simples ways of installing R is to use Conda package manager. Once R is set up, install Dorieh with either of the following command:

pip install dorieh[FST]

pip install "git+https://github.com/NSAPH-Data-Platform/dorieh[FST]"

Docker Container

To build your own Dorieh Docker image see docker directory

A prebuilt docker image with Dorieh is provided:

docker pull forome/dorieh

Built-in Workflows

For examples of data processing workflows, see included data processing workflows

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dorieh-0.2.1.tar.gz (14.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dorieh-0.2.1-py3-none-any.whl (14.6 MB view details)

Uploaded Python 3

File details

Details for the file dorieh-0.2.1.tar.gz.

File metadata

  • Download URL: dorieh-0.2.1.tar.gz
  • Upload date:
  • Size: 14.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.2

File hashes

Hashes for dorieh-0.2.1.tar.gz
Algorithm Hash digest
SHA256 05e7e159d10d4d94d2cab22f28c33ab6bba1c6bb1bba81f4235f2b712ff43b79
MD5 f3d5992f2b3bea0b0535762d80a48284
BLAKE2b-256 1d56eeace5a90a55b0aa2a8e4f39420359498420ca3f52e7808e4773e676b26d

See more details on using hashes here.

File details

Details for the file dorieh-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: dorieh-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.2

File hashes

Hashes for dorieh-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ff1e3e152f407988e6545e8490491d6de9e13a8a7fc789cb2f0c283cad1b37a
MD5 ebde823cd76f3fb40add6b93e846a4a7
BLAKE2b-256 d6a3d592743a11fea21ce7dfdffb349cf7d7d5847c9b4045c3284150a52aea35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page