A runtime system for NMDC data management and orchestration
Project description
A runtime system for NMDC data management and orchestration.
How It Fits In
-
nmdc-metadata tracks issues related to NMDC metadata, which may necessitate work across multiple repos.
-
nmdc-schema houses the LinkML schema specification, as well as generated artifacts (e.g. JSON Schema).
-
nmdc-server houses code specific to the data portal -- its database, back-end API, and front-end application.
-
workflow_documentation references workflow code spread across several repositories, that take source data and produce computed data.
-
This repo (nmdc-runtime)
- houses code that takes source data and computed data, and transforms it to broadly accommodate downstream applications such as the data portal
- manages execution of the above (i.e., lightweight data transformations) and also of computationally- and data-intensive workflows performed at other sites, ensuring that claimed jobs have access to needed configuration and data resources.
Data exports
The NMDC metadata as of 2021-07 is available here:
https://drs.microbiomedata.org/ga4gh/drs/v1/objects/y3ax-8bq3-60
The link returns a GA4GH DRS API bundle object record, with the NMDC metadata collections (study_set, biosample_set, etc.) as contents, each a DRS API blob object.
For example the blob for the study_set collection export, named "study_set.json.bz2", is listed with DRS API ID "jh4z-z81d-76". Thus, it is retrievable via
https://drs.microbiomedata.org/ga4gh/drs/v1/objects/jh4z-z81d-76
The returned blob object record lists https://portal.nersc.gov/project/m3408/meta/mongoexports/2021-07/study_set.json.bz2 as the url for an access method.
The 2021-07 exports are currently all accessible at https://portal.nersc.gov/project/m3408/meta/mongoexports/2021-07/ , but the DRS API indirection allows these links to change in the future, for mirroring via other URLs, etc. So, the DRS API links should be the links you share.
Overview
The runtime features:
-
Dagster orchestration:
- dagit - a web UI to monitor and manage the running system.
- dagster-daemon - a service that triggers pipeline runs based on time or external state.
- PostgresSQL database - for storing run history, event logs, and scheduler state.
- workspace code
- Code to run is loaded into a Dagster
workspace
. This code is loaded from one or more dagsterrepositories
. Each Dagsterrepository
may be run with a different Python virtual environment if need be, and may be loaded from a local Python file orpip install
ed from an external source. In our case, each Dagsterrepository
is simply loaded from a Python file local to the nmdc-runtime GitHub repository, and all code is run in the same Python environment. - A Dagster repository consists of
solids
andpipelines
, and optionallyschedules
andsensors
.solids
represent individual units of computationpipelines
are built up from solidsschedules
trigger recurring pipeline runs based on timesensors
trigger pipeline runs based on external state
- Each
pipeline
can declare dependencies on any runtimeresources
or additional configuration. There are TerminusDB and MongoDBresources
defined, as well aspreset
configuration definitions for both "dev" and "prod"modes
. Thepreset
s tell Dagster to look to a set of known environment variables to load resources configurations, depending on themode
.
- Code to run is loaded into a Dagster
-
A TerminusDB database supporting revision control of schema-validated data.
-
A MongoDB database supporting write-once, high-throughput internal data storage by the nmdc-runtime FastAPI instance.
-
A FastAPI service to interface with the orchestrator and database, as a hub for data management and workflow automation.
Local Development
Ensure Docker (and Docker Compose) are installed.
# optional: copy .env.dev to .env (gitignore'd) and set those vars
make up-dev
Docker Compose is used to start local TerminusDB, MongoDB, and PostgresSQL (used by Dagster to log information) instances, as well as a Dagster web server (dagit) and daemon (dagster-daemon).
The Dagit web server is viewable at http://localhost:3000/.
The TerminusDB browser console is viewable at http://localhost:6364/.
The FastAPI service is viewable at http://localhost:8000/ -- e.g., rendered documentation at http://localhost:8000/redoc/.
Local Testing
Tests can be found in tests
and are run with the following command:
pytest tests
As you create Dagster solids and pipelines, add tests in tests/
to check that your
code behaves as desired and does not break over time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nmdc_runtime-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d088e177c0437474461213948c651ab6823a2a44139663f776952af6ded16c69 |
|
MD5 | 64490daa2819272941ef817ebcfad14d |
|
BLAKE2b-256 | c93265aa3b8e7dae9e5c896fa1fcaf559fb1784d824759c78b4c89682c681275 |