This Project is wrapper for projectoneflow ingestion patterns
Project description
ProjectOneflow-Framework Package is wrapper over the projectoneflow package which provides all data ingestion patterns
Quick Start
To Test Locally, Please Follow below steps
Install
Data Engineering Package is deployed on Pypi package manager. To install the package: 1. Run the below code to install the code `shell pip install projectoneflow-framework `
Basic Idea
Before getting to how to get started, let’s understand the what are all concepts are involved in this project.
This Framework introduces the concept contract, where any data project can be defined as contract, where contract has some assets/objects where data stakeholders are involved create those assets/object.
We need the contract specification to define the assets, Currently framework support one contract type called project contract.
Project contract specification defines datasets, pipelines, transform, deploy as its assets.
Lets talk about each asset in details:
Datasets: This assets hold the data objects like database schema, tables, views
Pipelines: This assets hold the data pipeline objects, pipeline tasks, pipeline transformation logic
Transform: This asset hold the module/re-usable logic used across the pipelines
Deploy: This asset will have configuration where above defined assets are deployed.
below sections has some instruction to get started
To Get Started
This package comes with command oframework. for quick reference for all sub-commands. Please use the command oframework -h.
To get started with framework, we can get the project contract set up using the below command
Please use below command: `shell oframework blueprint generate contract -f <TARGET_FOLDER_PATH> -c <CONTRACT_NAME> ` Above command will automatically generates the contract specification at provided folder. Once you have ran the above command, it creates the below folder structure at specified <TARGET_FOLDER_PATH>. See below directory structures: <CONTRACT_NAME>/,`<CONTRACT_NAME>/pipeline/,`<CONTRACT_NAME>/dataset/,`<CONTRACT_NAME>/transform/,`<CONTRACT_NAME>/<CONTRACT_NAME>.json
As defined above folder structure, command creates the the folder <CONTRACT_NAME> under <TARGET_FOLDER_PATH>. where pipeline, dataset, transform folders are specific to project contract assets. But by default project contract specification is created but command has option to specify the contract type under option –contract_type <CONTRACT_TYPE>. Created <CONTRACT_NAME>.json has the specification as below template. `json {"name":"<CONTRACT_NAME>","description":"","dataset":["datasets"],"pipelines":["pipelines"],"transform":["transform"],"deploy":{"databricks":null}}`
- To get started with the pipeline specification, we can use the same blueprint command with different sub-command as follows
`shell oframework blueprint generate pipeline -f <TARGET_FOLDER_PATH> ` Above command generates the pipeline specification at the pipeline folder in which pipeline json template is created following your answers. You need to specify <TARGET_FOLDER_PATH> which is used to write the generated template files, if not specified it saves to current directory.
- To get started with dataset specification, we can steps as above but for dataset we need to specify the data object type
`shell oframework blueprint generate dataset -n <DATAOBJECT_NAME> -t {schema,table,view} -f <TARGET_FOLDER_PATH> ` Above command generates the dataset of data object specification at the specified <TARGET_FOLDER_PATH> folder. Based on dataobject type specified it creates the template file.
Advance Commands
- To deploy the project contract, you can use the below command but requires the target deployment environment credentials either specified in project contract deploy specification or global environment variables. Please refer [environment section](#environment-variables)
`shell oframework deploy -f <PROJECT_FOLDER> --environment {local,dev,test,uat,prod} ` Deployment and destroying uses the terraform as this default deployment provider, but in future other provider can be specified based on availability.
- For destroying the project contract, you can use same command as above. Below is the example
`shell oframework destroy -f <PROJECT_FOLDER> --environment {local,dev,test,uat,prod} ` Destroying resource same as deployment but terraform state file need to same as deployment else some error or redeployment/destroying will happens. where state management can be done same as terraform state management and pass backend_config same as with terraform commands.
- For quick testing the pipeline, we can use the below command
`shell oframework run -f <PROJECT_FOLDER> --environment {local,dev,test,uat,prod} -p <PIPELINE_NAME> -t <CLUSTER_ID> ` Based on environment, pipeline deployment and running differs. In local environment it run pipeline in separate process and log it in temporary log file. where as other environment, pipeline will be deployed in target environment and run the pipeline and destroys post running the pipeline.
If you are running in other than local environment, which runs on databricks <CLUSTER_ID> can be helpful to run the pipeline faster without creating the new cluster
Environment Variables
As this framework, primary interface is command line interface. Where execution of project specification requires some configuration to be set based on environment and purpose like deployment or running or some security reasons. So some global environment variables are defined, so that developer/projects can use these global environment variables.
Below are the list of the global environment variables:
OF_TF_DATABRICKS_CLIENT_ID: Terraform deployment databricks client id for autentication
OF_TF_DATABRICKS_CLIENT_SECRET: Terraform deployment databricks client secret for autentication
OF_TF_DATABRICKS_ACCESS_TOKEN: Terraform deployment databricks access token for autentication
OF_TF_DATABRICKS_WORKSPACE: Terraform deployment databricks workspace url where resource are deployed
OF_TF_DATABRICKS_DEPLOY_CLUSTER_ID: Databricks cluster where running the pipeline cluster replace the provided cluster id
OF_TF_BACKEND_CONFIG: Terraform backend configuration specified in json configuration
OF_TF_DATABRICKS_CATALOG: Databricks catalog where all dataset objects are deployed
OF_DATABRICKS_ARTIFACTS_PATH: Databricks artifacts path where project pacakge stored
OF_PRESETTING_NAME_PREFIX: Contract pre-setting name prefix, if specified all resources are prefixed by this setting
OF_PRESETTING_NAME_SUFFIX: Contract pre-setting name prefix, if specified all resources are prefixed by this setting
OF_PIPELINE_TAGS: Contract pre-setting pipeline tags, if specified tags added to pipeline resources
OF_PIPELINE_TASK_CHECKPOINT_LOCATION_PREFIX: For stream pipeline processing require checkpoint location, this setting provide prefix to the checkpoint location of all tasks
OF_PIPELINE_METADATA_LOCATION: Metadata location where pipeline state are stored
OF_PRESETTING_TAGS: Contract pre-setting tags in json form, if specified all resources are specified with these tags by this setting
OF_DATABRICKS_SECRET_SCOPE: Secret scope environment variable used to specify for the environment specific databricks secret scope
OF_DEPLOY_CORE_PACKAGE_PATH: This is used to specify the library/path where projectoneflow-framework uses to run the pipeline, by default projectoneflow is taken but if you want untested version of projectoneflow, you can specify it here.
OF_DEPLOY_CORE_PACKAGE_TYPE: Type to be specified by default this should be specified with whl, if you specify the OF_DEPLOY_CORE_PACKAGE_PATH and this variable also need to defined
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file projectoneflow_framework-1.0.0.tar.gz.
File metadata
- Download URL: projectoneflow_framework-1.0.0.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59b34f88a54b92c36fca0f23d04c023ed335cf90216d37bad6f2ea96c8cf6ede
|
|
| MD5 |
5e0af32a240c361a070844c6f12caa9a
|
|
| BLAKE2b-256 |
4626bc85e9c71a4ea03d204fd9b860fe897483bd70924bffbd168bc60fb7cdf6
|