Skip to main content

No project description provided

Project description

Pype

Author: Sam Britton Version: 0.0.67 Date of Creation: 26/02/2024

This is a repo containing the code for defining python based pipelines!

Installation

To install simply run pip install git+https://github.com/Fathom-Financial-Consulting-Ltd/Pype.git

If you run pip freeze to make a requirements.txt, this package will show as pype=={version}. If you want the requirements.txt to work with pip install -r requirements.txt make sure to replace the pype=={version} with git+https://github.com/Fathom-Financial-Consulting-Ltd/Pype.git

For databricks notebook:

A few steps here: first you need to create a github personal access token, then you need to set up a databricks secret scope and add a secret where the value is the personal access token you just created. Once this is done then add this code in a cell at the top of your notebook:

git_token = dbutils.secrets.get(scope="<scope-name>", key="<secret-name>")

where you need to replace <scope-name> and <secret-name> with the databricks secret scope name and secret name that you just created. Then in another cell add this code:

%pip install git+https://$git_token@github.com/Fathom-Financial-Consulting-Ltd/Pype.git

and that's it! You should have pype installed on your cluster!

Summary

The main file in this package is the pipeline.py which contains the Job and Pipeline classes. These are the classes that define the pipelines, a Pipeline comprises an arbitrary number of Jobs(s)/Pipeline(s).

Job

A job is an object that takes a name, function, and either/both of a list of inputs and list of dependencies upon instantiation. The dependencies are a list of other Jobs(s)/Pipeline(s). If dependencies are present, the output of each dependency is provided to the job's function, in the order that they appear in the list, as the first arguments. Therefore, you must configure your job to match the order of arguments in the function, otherwise you will likely encounter an error.

Pipeline

This is an object that takes a name and a list of Job(s)/Pipeline(s) upon instantiation. It then contains a method to run them in sequence, in the order that they are provided in the list. A pipeline is essentially a network with Jobs as nodes where the job dependencies define the edges.

Other modules

There are several other modules in this package (so far ingestion and cleaning). These contain some useful functions that can be used as functions for Job objects.

Next steps

  • Azure integration - some method to programatically define these pipelines in data factory would be great

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fathompype-0.0.67.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

fathompype-0.0.67-py3-none-any.whl (7.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page