Simple tool for pandas data transformation
Project description
data-steps
This projects provides a minmal framework to organize data transformations in pandas.
It is intended to be used in both notebooks and code files.
The main idea is to provide a simple decorator syntax that is easy to maintains when data transfromation steps get changed or added throughout the project. A prime example is data cleaning where only later in the project some required cleaning steps become apparent.
Features
After wrapping a pandas DataFrame in a DataSteps
class. The following features are available.
- register data transformations with the instances
.step
decorator - get an overview of the registered steps with
.steps
- inspect the original data the fully transformed data and any partially transformed data in between
- change parameters of registered steps
- interactively redefine or deactivate steps in jupyter notebooks
- register steps that return secondary results, i.e. the main result is passed alon the pipeline, whereas the secondary result is stored seperately
- convert data steps pipelines to strings that can more easily be integrated into a non-eda code-base
Usage Example
Wrap your data in an instance
from data_steps import DataSteps
data = DataSteps(my_pandas_df)
#register transformation steps
@data.step
def data_transformation(df):
#transfromation steps
...
return transformed_df
@data.step
def transform_with_parameters(df,param1,param2=4):
#transfromation steps
...
return transformed_df
#access original data
data.original
#set or update transformation parameters
data.update_step_kwargs('transform_with_parameters',{'param1':10})
#access data after all transformation steps
data.transformed
#get an overview of the registered steps
data.steps
#only execute some steps to help debugging transformations
data.partial_transform(0)
History
0.0.1 (2021-01-31)
- First release on PyPi.
0.1.0 (2021-02-11)
- Changed step decorator to work in bare format,
i.e.
<instance>.step
instead of<instance>.step()
0.2.0 (2021-05-02)
- support for additional arguments in steps
0.3.0 (2021-05-30)
- support for exporting a datasteps pipeline as a string
- Enable steps to contain side results next to the transformed data. These could be summaries for diagnostics or plots for an intermediate result
0.3.1 (2021-05-30)
- bugfixes
0.3.2 (2021-05-30)
- bugfixes
0.3.3 (2021-08-01)
- Downgraded python requirements to 3.7 to enable google colab support by default.
Possible extensions
No Concrete plans at the moment but feel free to open enhancement issues on github