Skip to main content

Simple tool for pandas data transformation

Project description

data-steps

This projects provides a minmal framework to organize data transformations in pandas.

It is intended to be used in both notebooks and code files.

The main idea is to provide a simple decorator syntax that is easy to maintains when data transfromation steps get changed or added throughout the project. A prime example is data cleaning where only later in the project some required cleaning steps become apparent.

Features

After wrapping a pandas DataFrame in a DataSteps class. The following features are available.

  • register data transformations with the instances .step decorator
  • get an overview of the registered steps with .steps
  • inspect the original data the fully transformed data and any partially transformed data in between
  • change parameters of registered steps
  • interactively redefine or deactivate steps in jupyter notebooks
  • register steps that return secondary results, i.e. the main result is passed alon the pipeline, whereas the secondary result is stored seperately
  • convert data steps pipelines to strings that can more easily be integrated into a non-eda code-base

Usage Example

Wrap your data in an instance

from data_steps import DataSteps

data = DataSteps(my_pandas_df)

#register transformation steps

@data.step
def data_transformation(df):
    #transfromation steps
    ...
    return transformed_df

@data.step
def transform_with_parameters(df,param1,param2=4):
    #transfromation steps
    ...
    return transformed_df

#access original data
data.original

#set or update transformation parameters
data.update_step_kwargs('transform_with_parameters',{'param1':10})

#access data after all transformation steps
data.transformed


#get an overview of the registered steps
data.steps

#only execute some steps to help debugging transformations
data.partial_transform(0)

History

0.0.1 (2021-01-31)

  • First release on PyPi.

0.1.0 (2021-02-11)

  • Changed step decorator to work in bare format, i.e. <instance>.step instead of <instance>.step()

0.2.0 (2021-05-02)

  • support for additional arguments in steps

0.3.0 (2021-05-30)

  • support for exporting a datasteps pipeline as a string
  • Enable steps to contain side results next to the transformed data. These could be summaries for diagnostics or plots for an intermediate result

0.3.1 (2021-05-30)

  • bugfixes

0.3.2 (2021-05-30)

  • bugfixes

0.3.3 (2021-08-01)

  • Downgraded python requirements to 3.7 to enable google colab support by default.

Possible extensions

No Concrete plans at the moment but feel free to open enhancement issues on github

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-steps-0.3.3.tar.gz (10.5 kB view details)

Uploaded Source

File details

Details for the file data-steps-0.3.3.tar.gz.

File metadata

  • Download URL: data-steps-0.3.3.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.7.9

File hashes

Hashes for data-steps-0.3.3.tar.gz
Algorithm Hash digest
SHA256 7394cba559ca0d72e10e2566ee23ed4c8435584d87c23b06f05277d01655ae9c
MD5 80ab91c3c76748e84a464b6ac0131245
BLAKE2b-256 8251f4957a71698f2e432823965515b87ad546c7544563bbbad9e485ab130f4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page