Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery
Project description
Crux-Odin Library
Open Data Integration Nomenclature (ODIN) is Crux’s standard for declarative data delivery. ODIN provides a nomenclature for delivery that incentivizes industry-standard GitOps practices. ODIN specs are inherently abstracted from their underlying control planes and workflow frameworks, but work with the Crux External Data Platform.
Installing Crux-Odin
You install the Crux-Odin library via PyPI and pip in any Python environment you wish. You can install
it in a venv environment, a pipenv environment, or a poetry environment. You can also install it
at the system level if you wish. The installation doesn't vary from any other Python package. Just do
a pip install crux-odin or pip install crux-odin==<version> and you're good to go.
Using Crux-Odin
Crux-Odin features include:
- It specifies a YAML standard data format for data delivery. You specify the metadata and the pipelines and steps that need to run in these pipelines in YAML. The YAML file is versioned and the later versions add more statements to the YAML file in each version. Also, versions are backward compatible so a later version supports all the statements in an earlier version. See below for the different versions and what they contain.
- It contains routines for validating the YAML and making sure the fields are set correctly and the structure is
correct. These syntax specification for these versions are contained in a file called
workflow_crd.yamlwhich contains JSON Schema specifications of the syntax of the different YAML versions. (You can override the path to this file with theWORKFLOW_CRDenvironment variable). - It contains a routine
create_workflow()that allows you to convert the YAML specification into an internal Python first classWorkflowobject that you can manipulate. In programming languages, you generally want to deal in first class objects. - YAML files can exist in a tree. Children in a YAML file can point to their parent with the
parent:field. When processing or using these YAML files, we first have to merge files YAML files from the bottom up to the top. This library contains routines for reading in these YAML files and merging them. It also contains routines for locating the parents for children in a file system hierarchy.
Changing Your Code
from crux_odin.dict_utils import yaml_file_to_dict
from crux_odin.dataclass import create_workflow
workflow = create_workflow(yaml_file_to_dict("file.yaml")) # Version of Workflow gotten from YAML file
Validating YAML
from crux_odin.validate_yaml import validate_yaml
validate_yaml('file.yaml')
See YAMLFileClosures for routines for merging parent and child YAML files and dict_utils.py for routines
for merge dictionaries.
Crux-Odin YAML Versions
V1.0.0 - Crux's proprietary PDK framework
Supported information:
Some of the information stored in the YAML file is
- ID (airflow specific)
- Connection info + extraction info
- Normalizer spec
- Schema history + schema validations
- Context / Environment Variables
id: sample_id
run_uber_step: true
global:
global:
encoding: ascii
timedelta:
days: -1
schema_def:
na_values: [ "", " " ]
crux_api_conf: ${SAMPLE_ID_API}
endpoint: ${API_HOST}
extract:
action_class: pipeline.crux_pdk.actions.extract.extractor.ShortCircuitExtractor
connection_lib: pipeline.custom_libs.sample.connector
fetch_method: fetch_directory
remote_path: /pub/sparx/
connection:
type: SAMPLE_ID_CONNECTOR
conf: ${CRUX_SPARTA_SFTP}
zendesk_conf:
wait_time: 60
payload:
organization_id: 123123123123
role: end-user
ticket_restriction: organization
skip_verify_email: true
pipelines:
- id: sample_id
global:
global:
supplier_implied_date_regex: active_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
provenance_file_patterns:
origin_patterns:
- active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
return_patterns:
- active_ts_users_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
steps:
- id: extract
category: short_circuit
conf:
file_patterns:
- active_users_{FD_YYYY}{FD_MM}{FD_DD}\.csv
Note: the outside global is inherited by the pipelines, the 'inside' global is inherited by the steps.
The IDs have to match, extract above matches - id: extract
V1.1.0 - True Declarative Dataset
This is the first version of the spec that replaces the .py DAG files with full declarative syntax in YAML.
Newly supported capabilities
- Schedule
...:
dag:
dag_catchup: false # (schedule catch up runs to current date starting from start date)
dag_start_date: '2023-03-12' # (when the dag start running)
enable_delivery_cache: false # (required for dag files)
max_active_runs: 10. # (max active run in airflow system for this dag)
owner: CruxInformatics # (owner of the dag)
priority_weight: 1 # (the order in which this dag is given priority compared to others)
schedule_interval: '@once' # (dag schedule, when dag will be triggered after dag start date)
queue: kubernetes # (the worker pool to run the jobs,
# on cloud composer options: default and kubernetes, AF 1.9: default, ongoing and history)
tags: # (enable the tag searches/filtering on optional and only available in cloud composer)
- spm
- delivery-dispatch
V1.2.0 - Dataset, Data Product, and Organization Identifiers
Version 1.2 brings Dataset ID, a grouping of all data delivered or failed together, Data Product ID, a catalog-oriented collection of Datasets, and Org ID, a useful organizational grouping of Datasets. These fields are opinionated to Crux's control plane, but we find these concepts are widely used and necessary for most Control Plane implementations of ODIN. We will be reviewing generic versions of this in the future. The dataset_id and data_product_id will be validated to make sure they are in the org_id. crux_api_conf is not deprecated in this version (yet).
data_product_id is optional while dataset_id and org_id are required.
Newly supported capabilities
- Dataset ID
- Data Product ID
- Org ID
...:
metadata:
dataset_id: 'Ds012345' # a grouping of transactional data. 1:1 with ODIN spec
data_product_id: 'Dp012345' # a collection of Datasets for cataloging and productization
org_id: 'Or012345' # organizational identifier for a control plane
V1.3.0 - Vendor Declarations, Declared & Observed Schemas
Version 1.3 adds support for users to track schemas that are declared in vendor documentation, as well as observed from profiling the data. These schemas are only advisory, as the configured schema is what is primarily used in control plane implementations. These new fields are made optional, and the vendor-declared schema is defined and validated according to the requirements needed for ODIN to support hydration of the Crux Catalog. Frame description is also added as an optional field to the configured schema.
Newly supported capabilities
- declared_schema_def
- observed_schema_def
- vendor_doc
- frame_description
...:
pipelines:
- id:
vendor_doc: # URI (optional)
global:
global:
schema_def: # Already exists
...
declared_schema_def: # declared or curated schema (optional)
vendor_table_name:
vendor_table_description:
vendor_schedule:
fields:
- name:
data_type:
configured_data_type: # must exist in schema_def and have same type
configured_name: # must exist in schema_def and have same name
column_number:
is_primary_key:
page_number:
vendor_description:
- name: ...
observed_schema_def: # observed from profiling the data
fields:
- data_type:
name:
configured_data_type: # must exist in schema_def and have same type
configured_name: # must exist in schema_def and have same name
- name: ...
V1.4.0 Availability Deadlines
Allows the Control Plane the ability to apply to hydrate delivery deadlines. This provides users visibility into upstream data availability issues by specifying a cadence for expected new data.
Newly supported capabilities
- Deadline
availability_deadlines:
- deadline_minute: '30' # The minute to run the check
deadline_hour: '8' # The hour to evaluate the
deadline_day_of_month: '*' # Used for monthly and longer frequencies
deadline_month: '*' # Used for yearly frequency
deadline_day_of_week: '1' # supports cron pattern day range and *W (for weekdays)
deadline_year: '*' # Must be *
file_frequency: 'weekly' # One of "intraday", "daily", "weekly", "bi-weekly", "monthly", "semi-annual", "yearly"
timezone: 'UTC'
- deadline_minute: '30' # Supports multiple deadlines
deadline_hour: '8'
deadline_day_of_month: '*'
deadline_month: '*'
deadline_day_of_week: '5'
deadline_year: '*'
file_frequency: 'weekly'
timezone: 'UTC'
piplines:
...
V1.5.0 - Destinations
This allows a list of Destinations selected from the domain model to be used by Delivery Dispatch.
Newly supported capabilities
- Destinations
destinations:
- id: AQxxxxxxxxxx
name: Customer FTP site
...
V1.6.0 - Require crux_api_conf at the OUTER level
We used to allow the crux_api_conf declaration to exist at the step level or
under the conf keyword of any step. If it was declared there, then we would move
it up to the global level when we read in the YAML and created a Workflow object
(we'd select the first one we found). We now longer do this and require that the
crux_api_conf exist at the outer level of the YAML file.
Example
id: sample_id
...
crux_api_conf: ${SAMPLE_ID_API}
Roadmap
This roadmap outlines the incremental modeling capabilities that we plan to support in ODIN, but is not a commitment.
V1.X.0 Notifications
This allows for notification channels.
Thanks to all the contributors:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crux_odin-1.11.5.tar.gz.
File metadata
- Download URL: crux_odin-1.11.5.tar.gz
- Upload date:
- Size: 36.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9c91b99ebe6caf703cde76ed1179ac69d5412f5a61c3682fc50f667fd3dd9f4
|
|
| MD5 |
77933c0e681edb1b8fefcaffa711be96
|
|
| BLAKE2b-256 |
d1f85cc2e3d11cfb9a74cb8755f7fdd14a47fa796326626f023b7ebfa9232aee
|
File details
Details for the file crux_odin-1.11.5-py3-none-any.whl.
File metadata
- Download URL: crux_odin-1.11.5-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d4c6e67332055e4f2fac564593610bb02e63ed570a714658107015a1a05a67b
|
|
| MD5 |
4afdc2acfaa436dfb68534a2e26c550c
|
|
| BLAKE2b-256 |
d0aada248b1c876c9654d5fe49a0e5710b24629bb4c571275a8799cb42307997
|