Skip to main content

Create simple Pipelines with Python

Project description

🎯 simple_dag

pypi Documentation Status Updates

Welcome to simple_dag! Here, we provide the easiest way to create a pipeline in an orchestration-agnostic manner. Just decorate your functions with our @transform decorator! 🎉

  • Free software: MIT license

🚀 Getting Started

DAG

git clone https://github.com/leokster/simple_dag.git
cd simple_dag
python3.10 -m venv venv
source venv/bin/activate
pip install simple_dag
venv/bin/dagit -f examples/dag.py

💡 The Main Ideas

What is a DAG? 🤔:

A DAG, or Directed Acyclic Graph, represents a set of functions (the nodes) and their dependencies (the edges). It allows us to execute many functions, which depend on each other, in a specific order.

Aren't there already many DAG libraries?:

Absolutely, but most of them are tightly coupled to specific orchestration frameworks and require a very specific way to define a DAG. This makes it challenging to switch between frameworks. Our library, however, is different! 🎈

What is the goal of this library?:

Our library aims to offer a simple and streamlined way to define a DAG in a framework-agnostic manner. This means you can switch between frameworks without having to rewrite your DAG. As of now, we support Dagster and direct execution. 🎯

What is a transform?:

In the context of a data pipeline, a transform is a function that takes some data as input and produces some new data as output. It's like the magic wand in your data pipeline. 🪄

Show me some code! 👩‍💻:

Imagine we have a transformation where we read a CSV file, filter the data, and write it to a new CSV file. The @transform decorator marks a function as a transformation function. PandasDFInput and PandasDFOutput prepare the data for the transformation and write the post-transformation data, respectively. df is the input data and output is the output data.

import os
from simple_dag import transform, PandasDFInput, PandasDFOutput

@transform(
        df=PandasDFInput(
                os.path.join("data/curated/ds_salaries_2023.csv"),
        ),
        output=PandasDFOutput(
                os.path.join("data/curated/ds_salaries_2023_ES.csv"),
        ),
)
def create_2023_salaries_ES(df, output: PandasDFOutput):
df = df[df["company_location"] == "ES"]
output.write_data(df, index=False)

@transform:

This decorator indicates that a function is a transformation. It accepts Input and Output arguments. Please note, the Output arguments are passed directly to your function, while the Input arguments are processed by the Input class and then the resultant data is passed to your function.

Input:

Inputs prepare the data for your function. Currently, we support the following inputs:

  • PandasDFInput: Reads a pandas dataframe from a CSV file. The function receives this data as a pandas dataframe.
  • BinaryInput: Reads a binary file. The function receives this data as a bytes object.
  • SparkDFInput: Reads a Spark dataframe from a parquet file (Experimental). The function receives this data as a Spark dataframe.

Output:

Outputs write the data after your function has processed it. The Output objects have a write_data method, which can be used in your function to write the data. Currently, we support the following outputs:

  • PandasDFOutput: Writes a pandas dataframe to a CSV file.
  • BinaryOutput: Writes a binary file.
  • SparkDFOutput: Writes a Spark dataframe to a parquet file (Experimental).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_dag-1.0.4.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

simple_dag-1.0.4-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file simple_dag-1.0.4.tar.gz.

File metadata

  • Download URL: simple_dag-1.0.4.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for simple_dag-1.0.4.tar.gz
Algorithm Hash digest
SHA256 2a9a6a1898db82e93f5cdcdccb7371d47d3ffc6c39b9d8ee59d5f851e7c65a15
MD5 2a2f188b6d16dd16f077fb43fbb70bc0
BLAKE2b-256 0110d6b9a6ef7b42dc9dbf677073f336bccf886d7ba657c3fd9c0510565dce01

See more details on using hashes here.

File details

Details for the file simple_dag-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: simple_dag-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for simple_dag-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e49216237547202a3d771f57e5072989521ceeaca5496891af9b0f18cfe24995
MD5 133bc0727f0eb15b4d8e9334d4d92813
BLAKE2b-256 46e4274d5500a4f32979f3be125f4a8eaedfc5797000c70be350305e91484866

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page