No project description provided
Project description
datarails -- A Simple Framework for Dataframe ETL
-- VERSION 0.2.3 --
Datarails is a simple framework for organizing your in memory Dataframe based ETL jobs. It doesn't matter of you are using pandas
, spark
, glue
or anything else
this library serves as a simple way to structure your ETL jobs so that others don't come along and have to debug your 300 line script by copy / pasting sections of it into
a jupyter notebook.
Steps are defined as classes and then passed to a step runner. All methods in the class that start with step_
will be run in the order they are defined.
Each step has access to a DataBox
object that can be used to store dataframes and access them by name in downstream steps.
In addition to the DataBox
object, each step has access to a DataRailsContext
object that can be used to store and access variables that are not dataframes.
Basic Usage
import pandas as pd
from datarails.step import DataRailsStep
from datarails.runner import StepRunner
class LoadDataFromCSV(DataRailsStep):
def step_load_csv(self) -> None:
print('loading data from csv')
df = pd.read_csv('data.csv')
self.dbx.put_df('data', df) # dbx now has an attribute called data that is a dataframe
class TransformData(DataRailsStep):
def step_add_new_column(self) -> None:
print('adding new column')
self.dbx.data['new_column'] = self.dbx.data['old_column'] * 2
def step_drop_all_null_rows(self) -> None:
print('dropping null rows')
self.dbx.data = self.dbx.data.dropna()
def rename_columns(self) -> None:
print('renaming columns')
self.dbx.data = self.dbx.data.rename(columns={'new_column': 'blue_column'})
class SaveData(DataRailsStep):
def step_save_data(self) -> None:
print('saving data')
self.dbx.data.to_csv('new_data.csv')
steps = [
LoadDataFromCSV,
TransformData,
SaveData
]
runner = StepRunner(steps=steps)
runner.run()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datarails-0.2.3.tar.gz
.
File metadata
- Download URL: datarails-0.2.3.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1a306bc941794ce77f7ffd4c13155e88cdefdff08dec8d5246d8039ce188d2d |
|
MD5 | f0711e17c70b904f0dca0751bb3acac5 |
|
BLAKE2b-256 | 0c1dca922b2212e2ec4b87af7f65b43919e547c663c66e6208f18fe9460cfa13 |
Provenance
File details
Details for the file datarails-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: datarails-0.2.3-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 902279ef3513fde0014150113012ae9b0510c6304dc932b75c3e0fb9b759938c |
|
MD5 | 55b2881d2de3a5faae783e2c89d776c8 |
|
BLAKE2b-256 | c26e026ef63ec678e6242244f1eea0717b6451140d9d31370de243eb30e9c651 |