Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery

These details have not been verified by PyPI

Project description

Odin logo

Crux-Odin Library

Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery. ODIN provides a nomenclature for delivery that incentivizes industry-standard GitOps practices. ODIN specs are inherently abstracted from their underlying control planes and workflow frameworks, but work with the Crux External Data Platform.

Installing Crux-Odin

You install the Crux-Odin library via PyPI and pip in any Python environment you wish. You can install it in a venv environment, a pipenv environment, or a poetry environment. You can also install it at the system level if you wish. The installation doesn't vary from any other Python package. Just do a pip install crux-odin or pip install crux-odin==<version> and you're good to go.

Using Crux-Odin

Crux-Odin features include:

It specifies a YAML standard data format for data delivery. You specify the metadata and the pipelines and steps that need to run in these pipelines in YAML. The YAML file is versioned and the later versions add more statements to the YAML file in each version. Also, versions are backward compatible so a later version supports all the statements in an earlier version. See below for the different versions and what they contain.
It contains routines for validating the YAML and making sure the fields are set correctly and the structure is correct. These syntax specification for these versions are contained in a file called workflow_crd.yaml which contains JSON Schema specifications of the syntax of the different YAML versions. (You can override the path to this file with the WORKFLOW_CRD environment variable).
It contains a routine create_workflow() that allows you to convert the YAML specification into an internal Python first class Workflow object that you can manipulate. In programming languages, you generally want to deal in first class objects.
YAML files can exist in a tree. Children in a YAML file can point to their parent with the parent: field. When processing or using these YAML files, we first have to merge files YAML files from the bottom up to the top. This library contains routines for reading in these YAML files and merging them. It also contains routines for locating the parents for children in a file system hierarchy.

Changing Your Code

from crux_odin.dict_utils import yaml_file_to_dict
from crux_odin.dataclass import create_workflow

workflow = create_workflow(yaml_file_to_dict("file.yaml"))  # Version of Workflow gotten from YAML file

Validating YAML

from crux_odin.validate_yaml import validate_yaml

validate_yaml('file.yaml')

See YAMLFileClosures for routines for merging parent and child YAML files and dict_utils.py for routines for merge dictionaries.

Crux-Odin YAML Versions

V1.0.0 - Crux's proprietary PDK framework

Supported information:

Some of the information stored in the YAML file is

ID (airflow specific)
Connection info + extraction info
Normalizer spec
Schema history + schema validations
Context / Environment Variables

id: sample_id
run_uber_step: true

global:
  global:
    encoding: ascii
    timedelta:
      days: -1
    schema_def:
      na_values: [ "", " " ]
    crux_api_conf: ${SAMPLE_ID_API}
    endpoint: ${API_HOST}
  extract:
    action_class: pipeline.crux_pdk.actions.extract.extractor.ShortCircuitExtractor
    connection_lib: pipeline.custom_libs.sample.connector
    fetch_method: fetch_directory
    remote_path: /pub/sparx/
    connection:
      type: SAMPLE_ID_CONNECTOR
      conf: ${CRUX_SPARTA_SFTP}
      zendesk_conf:
        wait_time: 60
        payload:
          organization_id: 123123123123
          role: end-user
          ticket_restriction: organization
          skip_verify_email: true

pipelines:
  - id: sample_id
    global:
      global:
        supplier_implied_date_regex: active_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
        provenance_file_patterns:
          origin_patterns:
            - active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
          return_patterns:
            - active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
    steps:
      - id: extract
        category: short_circuit
        conf:
          file_patterns:
            - active_users_{FD_YYYY}{FD_MM}{FD_DD}\.csv

Note: the outside global is inherited by the pipelines, the 'inside' global is inherited by the steps. The IDs have to match, extract above matches - id: extract

V1.1.0 - True Declarative Dataset

This is the first version of the spec that replaces the .py DAG files with full declarative syntax in YAML.