Skip to main content

No project description provided

Project description

datarails -- A Simple Framework for Dataframe ETL

-- VERSION 0.2.3 --

Datarails is a simple framework for organizing your in memory Dataframe based ETL jobs. It doesn't matter of you are using pandas, spark, glue or anything else this library serves as a simple way to structure your ETL jobs so that others don't come along and have to debug your 300 line script by copy / pasting sections of it into a jupyter notebook.

Steps are defined as classes and then passed to a step runner. All methods in the class that start with step_ will be run in the order they are defined. Each step has access to a DataBox object that can be used to store dataframes and access them by name in downstream steps.

In addition to the DataBox object, each step has access to a DataRailsContext object that can be used to store and access variables that are not dataframes.

Basic Usage

import pandas as pd
from datarails.step import DataRailsStep
from datarails.runner import StepRunner


class LoadDataFromCSV(DataRailsStep):
    
    def step_load_csv(self) -> None:
        print('loading data from csv')
        df = pd.read_csv('data.csv')
        self.dbx.put_df('data', df) # dbx now has an attribute called data that is a dataframe


class TransformData(DataRailsStep):
    
    def step_add_new_column(self) -> None:
        print('adding new column')
        self.dbx.data['new_column'] = self.dbx.data['old_column'] * 2
    
    def step_drop_all_null_rows(self) -> None:
        print('dropping null rows')
        self.dbx.data = self.dbx.data.dropna()

    def rename_columns(self) -> None:
        print('renaming columns')
        self.dbx.data = self.dbx.data.rename(columns={'new_column': 'blue_column'})

        
class SaveData(DataRailsStep):
        
    def step_save_data(self) -> None:
        print('saving data')
        self.dbx.data.to_csv('new_data.csv')


steps = [
    LoadDataFromCSV,
    TransformData,
    SaveData
]        

runner = StepRunner(steps=steps)
runner.run()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datarails-0.2.3.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

datarails-0.2.3-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file datarails-0.2.3.tar.gz.

File metadata

  • Download URL: datarails-0.2.3.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for datarails-0.2.3.tar.gz
Algorithm Hash digest
SHA256 a1a306bc941794ce77f7ffd4c13155e88cdefdff08dec8d5246d8039ce188d2d
MD5 f0711e17c70b904f0dca0751bb3acac5
BLAKE2b-256 0c1dca922b2212e2ec4b87af7f65b43919e547c663c66e6208f18fe9460cfa13

See more details on using hashes here.

Provenance

File details

Details for the file datarails-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: datarails-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for datarails-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 902279ef3513fde0014150113012ae9b0510c6304dc932b75c3e0fb9b759938c
MD5 55b2881d2de3a5faae783e2c89d776c8
BLAKE2b-256 c26e026ef63ec678e6242244f1eea0717b6451140d9d31370de243eb30e9c651

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page