No project description provided
Project description
Pype
Author: Sam Britton Version: 0.0.67 Date of Creation: 26/02/2024
This is a repo containing the code for defining python based pipelines!
Installation
To install simply run pip install git+https://github.com/Fathom-Financial-Consulting-Ltd/Pype.git
If you run pip freeze to make a requirements.txt, this package will show as pype=={version}. If you want the requirements.txt to work with pip install -r requirements.txt make sure to replace the pype=={version} with git+https://github.com/Fathom-Financial-Consulting-Ltd/Pype.git
For databricks notebook:
A few steps here: first you need to create a github personal access token, then you need to set up a databricks secret scope and add a secret where the value is the personal access token you just created. Once this is done then add this code in a cell at the top of your notebook:
git_token = dbutils.secrets.get(scope="<scope-name>", key="<secret-name>")
where you need to replace <scope-name>
and <secret-name>
with the databricks secret scope name and secret name that you just created.
Then in another cell add this code:
%pip install git+https://$git_token@github.com/Fathom-Financial-Consulting-Ltd/Pype.git
and that's it! You should have pype installed on your cluster!
Summary
The main file in this package is the pipeline.py which contains the Job and Pipeline classes. These are the classes that define the pipelines, a Pipeline comprises an arbitrary number of Jobs(s)/Pipeline(s).
Job
A job is an object that takes a name, function, and either/both of a list of inputs and list of dependencies upon instantiation. The dependencies are a list of other Jobs(s)/Pipeline(s). If dependencies are present, the output of each dependency is provided to the job's function, in the order that they appear in the list, as the first arguments. Therefore, you must configure your job to match the order of arguments in the function, otherwise you will likely encounter an error.
Pipeline
This is an object that takes a name and a list of Job(s)/Pipeline(s) upon instantiation. It then contains a method to run them in sequence, in the order that they are provided in the list. A pipeline is essentially a network with Jobs as nodes where the job dependencies define the edges.
Other modules
There are several other modules in this package (so far ingestion and cleaning). These contain some useful functions that can be used as functions for Job objects.
Next steps
- Azure integration - some method to programatically define these pipelines in data factory would be great
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fathompype-0.0.67-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e4b526f8e00bf830a48eb499199dfb9937868390493bd1774ada0628d7b2860 |
|
MD5 | ecd6090ad0faed41744ef5d00ed72bbc |
|
BLAKE2b-256 | c37512d522e3f90942e7e30b21b93deb274671d1af82c258e62da60482f48eae |