Skip to main content

Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery

Project description

Odin logo

Crux-Odin Library

Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery. ODIN provides a nomenclature for delivery that incentivizes industry-standard GitOps practices. ODIN specs are inherently abstracted from their underlying control planes and workflow frameworks, but work with the Crux External Data Platform.

Installing Crux-Odin

You install the Crux-Odin library via PyPI and pip in any Python environment you wish. You can install it in a venv environment, a pipenv environment, or a poetry environment. You can also install it at the system level if you wish. The installation doesn't vary from any other Python package. Just do a pip install crux-odin or pip install crux-odin==<version> and you're good to go.

Using Crux-Odin

Crux-Odin features include:

  1. It specifies a YAML standard data format for data delivery. You specify the metadata and the pipelines and steps that need to run in these pipelines in YAML. The YAML file is versioned and the later versions add more statements to the YAML file in each version. Also, versions are backward compatible so a later version supports all the statements in an earlier version. See below for the different versions and what they contain.
  2. It contains routines for validating the YAML and making sure the fields are set correctly and the structure is correct. These syntax specification for these versions are contained in a file called workflow_crd.yaml which contains JSON Schema specifications of the syntax of the different YAML versions. (You can override the path to this file with the WORKFLOW_CRD environment variable).
  3. It contains a routine create_workflow() that allows you to convert the YAML specification into an internal Python first class Workflow object that you can manipulate. In programming languages, you generally want to deal in first class objects.
  4. YAML files can exist in a tree. Children in a YAML file can point to their parent with the parent: field. When processing or using these YAML files, we first have to merge files YAML files from the bottom up to the top. This library contains routines for reading in these YAML files and merging them. It also contains routines for locating the parents for children in a file system hierarchy.

Changing Your Code

from crux_odin.dict_utils import yaml_file_to_dict
from crux_odin.dataclass import create_workflow

workflow = create_workflow(yaml_file_to_dict("file.yaml"))  # Version of Workflow gotten from YAML file

Validating YAML

from crux_odin.validate_yaml import validate_yaml

validate_yaml('file.yaml')

See YAMLFileClosures for routines for merging parent and child YAML files and dict_utils.py for routines for merge dictionaries.

Crux-Odin YAML Versions

V1.0.0 - Crux's proprietary PDK framework

Supported information:

Some of the information stored in the YAML file is

  • ID (airflow specific)
  • Connection info + extraction info
  • Normalizer spec
  • Schema history + schema validations
  • Context / Environment Variables
id: sample_id
run_uber_step: true

global:
  global:
    encoding: ascii
    timedelta:
      days: -1
    schema_def:
      na_values: [ "", " " ]
    crux_api_conf: ${SAMPLE_ID_API}
    endpoint: ${API_HOST}
  extract:
    action_class: pipeline.crux_pdk.actions.extract.extractor.ShortCircuitExtractor
    connection_lib: pipeline.custom_libs.sample.connector
    fetch_method: fetch_directory
    remote_path: /pub/sparx/
    connection:
      type: SAMPLE_ID_CONNECTOR
      conf: ${CRUX_SPARTA_SFTP}
      zendesk_conf:
        wait_time: 60
        payload:
          organization_id: 123123123123
          role: end-user
          ticket_restriction: organization
          skip_verify_email: true

pipelines:
  - id: sample_id
    global:
      global:
        supplier_implied_date_regex: active_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
        provenance_file_patterns:
          origin_patterns:
            - active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
          return_patterns:
            - active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
    steps:
      - id: extract
        category: short_circuit
        conf:
          file_patterns:
            - active_users_{FD_YYYY}{FD_MM}{FD_DD}\.csv

Note: the outside global is inherited by the pipelines, the 'inside' global is inherited by the steps. The IDs have to match, extract above matches - id: extract

V1.1.0 - True Declarative Dataset

This is the first version of the spec that replaces the .py DAG files with full declarative syntax in YAML.

Newly supported capabilities

  • Schedule
...:
  dag:
   dag_catchup: false            # (schedule catch up runs to current date starting from start date)
   dag_start_date: '2023-03-12'  # (when the dag start running)
   enable_delivery_cache: false  # (required for dag files)
   max_active_runs: 10.          # (max active run in airflow system for this dag)
   owner: CruxInformatics        # (owner of the dag)
   priority_weight: 1            # (the order in which this dag is given priority compared to others)
   schedule_interval: '@once'    # (dag schedule, when dag will be triggered after dag start date)
   queue: kubernetes             # (the worker pool to run the jobs, 
                                 # on cloud composer options: default and kubernetes, AF 1.9: default, ongoing and history)
   tags:                         # (enable the tag searches/filtering on optional and only available in cloud composer)
   - spm
   - delivery-dispatch

V1.2.0 - Dataset, Data Product, and Organization Identifiers

Version 1.2 brings Dataset ID, a grouping of all data delivered or failed together, Data Product ID, a catalog-oriented collection of Datasets, and Org ID, a useful organizational grouping of Datasets. These fields are opinionated to Crux's control plane, but we find these concepts are widely used and necessary for most Control Plane implementations of ODIN. We will be reviewing generic versions of this in the future. The dataset_id and data_product_id will be validated to make sure they are in the org_id. crux_api_conf is not deprecated in this version (yet). data_product_id is optional while dataset_id and org_id are required.

Newly supported capabilities

  • Dataset ID
  • Data Product ID
  • Org ID
...:
  metadata:
      dataset_id: 'Ds012345'          # a grouping of transactional data. 1:1 with ODIN spec
      data_product_id: 'Dp012345'     # a collection of Datasets for cataloging and productization
      org_id: 'Or012345'              # organizational identifier for a control plane

V1.3.0 - Vendor Declarations, Declared & Observed Schemas

Version 1.3 adds support for users to track schemas that are declared in vendor documentation, as well as observed from profiling the data. These schemas are only advisory, as the configured schema is what is primarily used in control plane implementations. These new fields are made optional, and the vendor-declared schema is defined and validated according to the requirements needed for ODIN to support hydration of the Crux Catalog. Frame description is also added as an optional field to the configured schema.

Newly supported capabilities

  • declared_schema_def
  • observed_schema_def
  • vendor_doc
  • frame_description
...:
pipelines:
  - id:                           
    vendor_doc:                   # URI (optional)
    global:
      global:
          schema_def:                   # Already exists
            ...
          declared_schema_def:          # declared or curated schema (optional)
            vendor_table_name:
            vendor_table_description:
            vendor_schedule:
            fields:
            - name:
              data_type:
              configured_data_type:   # must exist in schema_def and have same type
              configured_name:        # must exist in schema_def and have same name
              column_number:
              is_primary_key:
              page_number:
              vendor_description:
            - name: ...
          observed_schema_def:          # observed from profiling the data
            fields:
            - data_type:
              name:
              configured_data_type:   # must exist in schema_def and have same type
              configured_name:        # must exist in schema_def and have same name
            - name: ...

V1.4.0 Availability Deadlines

Allows the Control Plane the ability to apply to hydrate delivery deadlines. This provides users visibility into upstream data availability issues by specifying a cadence for expected new data.

Newly supported capabilities

  • Deadline
availability_deadlines:
- deadline_minute: '30' # The minute to run the check
  deadline_hour: '8' # The hour to evaluate the 
  deadline_day_of_month: '*' # Used for monthly and longer frequencies
  deadline_month: '*' # Used for yearly frequency
  deadline_day_of_week: '1' # supports cron pattern day range and *W (for weekdays)
  deadline_year: '*' # Must be *
  file_frequency: 'weekly' # One of "intraday", "daily", "weekly", "bi-weekly", "monthly", "semi-annual", "yearly"
  timezone: 'UTC'
- deadline_minute: '30' # Supports multiple deadlines 
  deadline_hour: '8'
  deadline_day_of_month: '*'
  deadline_month: '*'
  deadline_day_of_week: '5'
  deadline_year: '*'
  file_frequency: 'weekly'
  timezone: 'UTC'
piplines:
...

V1.5.0 - Destinations

This allows a list of Destinations selected from the domain model to be used by Delivery Dispatch.

Newly supported capabilities

  • Destinations
destinations:
  - id: AQxxxxxxxxxx
    name: Customer FTP site
...

V1.6.0 - Require crux_api_conf at the OUTER level

We used to allow the crux_api_conf declaration to exist at the step level or under the conf keyword of any step. If it was declared there, then we would move it up to the global level when we read in the YAML and created a Workflow object (we'd select the first one we found). We now longer do this and require that the crux_api_conf exist at the outer level of the YAML file.

Example

id: sample_id
...
crux_api_conf: ${SAMPLE_ID_API}

Roadmap

This roadmap outlines the incremental modeling capabilities that we plan to support in ODIN, but is not a commitment.

V1.X.0 Notifications

This allows for notification channels.

Thanks to all the contributors:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crux_odin-1.11.6.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crux_odin-1.11.6-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file crux_odin-1.11.6.tar.gz.

File metadata

  • Download URL: crux_odin-1.11.6.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for crux_odin-1.11.6.tar.gz
Algorithm Hash digest
SHA256 7c0438b78b3c24961578804f278f946d053997e63e8de63ba1341892d9b6c048
MD5 8a8ee5f0237e09cbaab1e957d166105a
BLAKE2b-256 71694a680c19b62d5eeca544014f5ed1bcd399ca899ccb3a8e0990076762a37b

See more details on using hashes here.

File details

Details for the file crux_odin-1.11.6-py3-none-any.whl.

File metadata

  • Download URL: crux_odin-1.11.6-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for crux_odin-1.11.6-py3-none-any.whl
Algorithm Hash digest
SHA256 79a3dfbe479182583361d76191480dda6943429233a748ab7204119bb66f9102
MD5 ca8137a0c2421fcd80afec65432577ac
BLAKE2b-256 d9057dacbf31d3b5e0855ab95e5e92633383f172268d35f7b85a0080f69144b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page