Simple tool for pandas data transformation
Project description
data-steps
This projects provides a minmal framework to organize data transformations in pandas.
It is intended to be used in both notebooks and code files.
The main idea is to provide a simple decorator syntax that is easy to maintains when data transfromation steps get changed or added throughout the project. A prime example is data cleaning where only later in the project some required cleaning steps become apparent.
Features
After wrapping a pandas DataFrame in a DataSteps
class. The following features are available.
- register data transformations with the instances
.step
decorator - get an overview of the registered steps with
.steps
- inspect the original data the fully transformed data and any partially transformed data in between
- change parameters of registered steps
- interactively redefine or deactivate steps in jupyter notebooks
- register steps that return secondary results, i.e. the main result is passed alon the pipeline, whereas the secondary result is stored seperately
- convert data steps pipelines to strings that can more easily be integrated into a non-eda code-base
Usage Example
Wrap your data in an instance
from data_steps import DataSteps
data = DataSteps(my_pandas_df)
#register transformation steps
@data.step
def data_transformation(df):
#transfromation steps
...
return transformed_df
@data.step
def transform_with_parameters(df,param1,param2=4):
#transfromation steps
...
return transformed_df
#access original data
data.original
#set or update transformation parameters
data.update_step_kwargs('transform_with_parameters',{'param1':10})
#access data after all transformation steps
data.transformed
#get an overview of the registered steps
data.steps
#only execute some steps to help debugging transformations
data.partial_transform(0)
History
0.0.1 (2021-01-31)
- First release on PyPi.
0.1.0 (2021-02-11)
- Changed step decorator to work in bare format,
i.e.
<instance>.step
instead of<instance>.step()
0.2.0 (2021-05-02)
- support for additional arguments in steps
0.3.0 (2021-05-30)
- support for exporting a datasteps pipeline as a string
- Enable steps to contain side results next to the transformed data. These could be summaries for diagnostics or plots for an intermediate result
0.3.1 (2021-05-30)
- bugfixes
0.3.2 (2021-05-30)
- bugfixes
0.3.3 (2021-08-01)
- Downgraded python requirements to 3.7 to enable google colab support by default.
Possible extensions
No Concrete plans at the moment but feel free to open enhancement issues on github
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.