Skip to main content

Safe, reproducible data transformations with built-in auditing and validation

Project description

TransformPlan

TransformPlan: Auditable Data Transformation Pipelines

Python 3.10+ Coverage

Features

  • Declarative transformations: Build transformation pipelines using method chaining
  • Schema validation: Validate operations before execution with dry-run capability
  • Audit trails: Generate complete audit protocols with deterministic DataFrame hashing
  • Multi-backend support: Polars (default) and DuckDB backends with a pluggable Backend ABC
  • Serializable pipelines: Save and load transformation plans as JSON

Quick Example

from transformplan import TransformPlan, Col

# Build readable pipelines with 89 chainable operations
plan = (
    TransformPlan()
    # Standardize column names
    .col_rename(column="PatientID", new_name="patient_id")
    .col_rename(column="DOB", new_name="date_of_birth")
    .str_strip(column="patient_id")

    # Calculate derived values
    .dt_age_years(column="date_of_birth", new_column="age")
    .math_clamp(column="age", min_value=0, max_value=120)

    # Categorize patients age
    .map_discretize(column="age", bins=[18, 40, 65], labels=["young", "adult", "senior"], new_column="age_group")

    # Filter and clean
    .rows_filter(Col("age") >= 18)
    .rows_drop_nulls(columns=["patient_id", "age"])
    .col_drop(column="date_of_birth")
)

# Execute with schema validation — catch errors before they hit production
df_result, protocol = plan.process(df, validate=True)

# Serialize pipelines to JSON — version control your transformations
plan.to_json("patient_transform.json")

# Reload and reapply — reproducible results across environments
plan = TransformPlan.from_json("patient_transform.json")
df_result, protocol = plan.process(new_data)

Full Audit Trail — Every Step Tracked and Hashed

protocol.print(show_params=False)
======================================================================
TRANSFORM PROTOCOL
======================================================================
Input:  1000 rows × 5 cols  [a4f8b2c1]
Output: 847 rows × 5 cols   [e7d3f9a2]
Total time: 0.0247s
----------------------------------------------------------------------

#    Operation            Rows         Cols         Time       Hash
----------------------------------------------------------------------
0    input                1000         5            -          a4f8b2c1
1    col_rename           1000         5            0.0012s    b2e4a7f3
2    col_rename           1000         5            0.0008s    c9d1e5b8
3    str_strip            1000         5            0.0013s    c9d1e5b8        ○
4    dt_age_years         1000         6 (+1)       0.0041s    d4f2c8a1
5    math_clamp           1000         6            0.0015s    e1b7d3f9
6    map_discretize       1000         7 (+1)       0.0028s    f8a4c2e6
7    rows_filter          858 (-142)   7            0.0037s    a2e9f4b7
8    rows_drop_nulls      847 (-11)    7            0.0019s    b5c1d8e3
9    col_drop             847          6 (-1)       0.0006s    e7d3f9a2
======================================================================
○ = no effect (steps 3 did not change data)

DuckDB Backend

Run the same pipelines on DuckDB for SQL-based execution and native large-file handling:

import duckdb
from transformplan import TransformPlan, Col
from transformplan.backends.duckdb import DuckDBBackend

con = duckdb.connect()
rel = con.sql("SELECT * FROM 'patients.parquet'")

# Same plan — backend chosen at execution time
plan = (
    TransformPlan()
    .col_rename(column="PatientID", new_name="patient_id")
    .rows_filter(Col("age") >= 18)
    .math_round(column="score", decimals=2)
)

result, protocol = plan.process(rel, backend=DuckDBBackend(con))

Available Operations

Category Description Examples
col_ Column operations col_rename, col_drop, col_cast, col_add, col_select
math_ Arithmetic & scaling math_add, math_multiply, math_standardize, math_minmax, math_clamp
rows_ Row filtering & reshaping rows_filter, rows_drop_nulls, rows_sort, rows_unique, rows_pivot
str_ String operations str_lower, str_upper, str_strip, str_replace, str_split
dt_ Datetime operations dt_year, dt_month, dt_parse, dt_age_years, dt_diff_days
map_ Value mapping & encoding map_values, map_discretize, map_onehot, map_ordinal

Installation

pip install transformplan

Or with uv:

uv add transformplan

Development Setup

make install-dev   # Install with dev dependencies
make test          # Run the test suite
make lint          # Run ruff linting and pyright type checking
make format        # Fix import sorting and format code

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformplan-0.2.0.tar.gz (115.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transformplan-0.2.0-py3-none-any.whl (72.4 kB view details)

Uploaded Python 3

File details

Details for the file transformplan-0.2.0.tar.gz.

File metadata

  • Download URL: transformplan-0.2.0.tar.gz
  • Upload date:
  • Size: 115.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for transformplan-0.2.0.tar.gz
Algorithm Hash digest
SHA256 08de25c75b5c9df89886a16680478f1082e8c8d1a6f5ce499c8e39bd14e62ff4
MD5 e99df2302f755b2638e49d771620bc18
BLAKE2b-256 e3654eb9cbd83f783ec35e5e8282aa2bea876cb275c0853cb0c21c2b921e3c8f

See more details on using hashes here.

File details

Details for the file transformplan-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: transformplan-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for transformplan-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87851aecbb38c3c11bc4654c2cfeb06f112beb85f2a103d79205bec1e091897f
MD5 bd39939885c46e752fc60b687e218eea
BLAKE2b-256 5ab4c7cc4c9de7d3eca540c541b3a77f33d8152f350919db0515666c28de9023

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page