No project description provided
Project description
datarails -- A Simple Framework for Dataframe ETL
-- VERSION 0.3.0 --
Official Documentation
The official documentation is hosted on github pages at jesse.maitland.github.io
Example Project
There is an example repo containing some simple ETL jobs, notebooks and commands for building documentation at jessemaitland/datarails-example
Datarails is a simple framework for organizing your in memory Dataframe based ETL jobs. It doesn't matter of you are using pandas
, spark
, glue
or anything else
this library serves as a simple way to structure your ETL jobs so that others don't come along and have to debug your 300 line script by copy / pasting sections of it into
a jupyter notebook.
Basic Usage
Steps are defined as classes and then passed to a step runner. All methods in the class that start with step_
will be run in the order they are defined.
Each step has access to a DataBox
object that can be used to store dataframes and access them by name in downstream steps.
In addition to the DataBox
object, each step has access to a DataRailsContext
object that can be used to store and access variables that are not dataframes.
import pandas as pd
from datarails.step import DataRailsStep
from datarails.runner import DataRailsStepRunner
class LoadDataFromCSV(DataRailsStep):
def step_load_csv(self) -> None:
print('loading data from csv')
df = pd.read_csv('data.csv')
self.dbx.put_df('data', df) # dbx now has an attribute called data that is a dataframe
class TransformData(DataRailsStep):
def step_add_new_column(self) -> None:
print('adding new column')
self.dbx.data['new_column'] = self.dbx.data['old_column'] * 2
def step_drop_all_null_rows(self) -> None:
print('dropping null rows')
self.dbx.data = self.dbx.data.dropna()
def rename_columns(self) -> None:
print('renaming columns')
self.dbx.data = self.dbx.data.rename(columns={'new_column': 'blue_column'})
class SaveData(DataRailsStep):
def step_save_data(self) -> None:
print('saving data')
self.dbx.data.to_csv('new_data.csv')
# gather your steps in a list of class definitions. The class instances will be created by the step runner
# while the jobs is being executed.
steps = [
LoadDataFromCSV,
TransformData,
SaveData
]
# pass your steps to the step runner
runner = DataRailsStepRunner(steps=steps)
# Run the job. The steps will be run in the order they are defined in the list. Each method declared in a step will be
# executed in the order they are defined in the class.
if __name__ == '__main__':
runner.run()
Why Use DataRails?
datarails
is intended to solve a few simple problems that I have encountered while working with small to medium sized etl scripts in python. It is a very
simple framework for ETL job execution and nothing more. It is not a replacement for airflow
, luigi
, dagster
or any other workflow management tool, but rather
can serve as the "entry point" for your workflow management tool to execute your ETL job.
1. Break your ETL Job into Smaller Steps
Quite often what happens with "small" or "medium" sized etl jobs is that they are thrown together as a single script that does everything. This works ok until the first time your script throws an error. As during the ETL process, your error is likely due to a problem with the data, and the single script approach makes it difficult to debug.
2. Step Through Your ETL Job
In the event you do encounter a problem with your ETL job, with datarails
you can simply import your runner into a python shell or a jupyter notebook
like this:
from my_etl_job import runner
runner.advance() # run the next step and stop execution
This allows you to step through your job and inspect the data at each step. This is especially useful when you are working with a format such a json or csv as your source data that does not have a schema.
3. You Probably Don't Have Big Data
In all likelihood you have some external process fetching data and dumping it into S3 (or some other cloud storage) on a daily bases. The files entering your landing zone
are in the order of 20MB to 100MB and are stored in some horrible format like json
, or json lines
or even csv
. You need to transform this data into a format that is more useful and
do a bit of clean up so that the data is available for other teams in your datalake, or data warehouse. You probably have 10s or even 100s of jobs that are similar to this. datarails
is
a perfect library for these types of jobs.
4. Documentation
How often you you get a request from someone non-technical in the company who asks a question about an ETL job that was written 6 months ago? You get a question like,
"Hey, I noticed that the data in the blue_column
is different than the data in the red_column
. Can you tell me why that is?" You of course have no idea why that is, and you have to go
back to the code to figure it out.
datarails
forces you to break up your ETL job into smaller steps or methods. Since python provides many tools for building documentation from docstrings, you can easily
incorporate documentation using standard python docstrings, which can be published in your CI job. The documentation can then be available to business users, or other technical users
which will save you time from having to answer questions about your ETL jobs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datarails-0.3.0.tar.gz
.
File metadata
- Download URL: datarails-0.3.0.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64d81fe6154066f7ea66ef1ffa88b356de60956546c72a65bb432ebdd7e604d7 |
|
MD5 | 20f18391d472e410fc8bea47efac4d16 |
|
BLAKE2b-256 | 946e0075c11e5aa7d15a9671648ab705e01a4a1b8ab73a00a8a0bfbc8b867f56 |
Provenance
File details
Details for the file datarails-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: datarails-0.3.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b7b03448418c3e93de69a281896a001a42685f964a93e98e7deb116444b0ddc |
|
MD5 | 113f0c79e6fce29327b5718b754e0746 |
|
BLAKE2b-256 | c182cd760389ee0d58cb8ecb846cedfa3de439e1711f3a16b3094a804207a8b0 |