Skip to main content

Python Package that creates Data Preparation Pipeline in Teradata-SQL in Views

Project description

Logo

tdprepview

Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.

Installation

  • pip install tdprepview

Features

  • Pipeline class that allows creating in-DB preprocessing pipelines
  • Several Preprocessor functions
  • API similar to sklearn.Pipeline
  • Pipeline can be saved and loaded as dict/json
  • Pipeline can be automatically created based on data types and distributions.

Preprocessors

Quickstart

1. setting up the pipeline

import teradataml as tdml
database_credentials = {
    # ...
}
DbCon = tdml.create_context(**database_credentials)

myschema, table_train, table_score = "...", "...", "..."
DF_train = tdml.DataFrame(tdml.in_schema(myschema,table_train))
DF_score = tdml.DataFrame(tdml.in_schema(myschema,table_score))

from tdprepview import Pipeline
from tdprepview import (
    StandardScaler, IterativeImputer, PCA, DecisionTreeBinning )

steps = [
    ({"pattern":"^floatfeat_[1-5]$"}, [StandardScaler(), IterativeImputer()], {"suffix":"_imputed"}),
    ({"suffix":"_imputed"}, PCA(n_components=2), {"prefix":"mypca_"}),
    ({"prefix":"mypca_"}, DecisionTreeBinning(target_var="target_class", no_bins=3))
]

mypipeline = Pipeline(steps=steps)

2.a transform DataFrames directly

this is suitable if you are working in a sandbox Jupyter Notebook

# fit() calculates all necessary statistics in DB
mypipeline.fit(DF=DF_train)
# transform generates SQL syntax based on the chosen Preprocessors and the statistics from fit()
DF_train_transf = mypipeline.transform(DF_train)
DF_score_transf = mypipeline.transform(DF_score)
#... go on with some modelling (e.g. sklearn, TD_DecisionForest,...) 
# and scoring (via BYOM, TD_DecisionForestPredict, ...)

2.b inspect produced queries

if you want to check the SQL code, that is generated py tdprepview

mypipeline.fit(DF=DF_train)
query_train = mypipeline.transform(DF_train, return_type = "str")
print(query_train)

the output: (Note how columns not part of the pipeline are simply forwarded)

WITH preprocessing_steps AS
(
    SELECT
    row_id AS c_i_35,
    floatfeat_1 AS c_i_36,
    floatfeat_2 AS c_i_37,
    floatfeat_3 AS c_i_38,
    target_class AS c_i_39,
    ZEROIFNULL( (( c_i_36 ) - -0.2331302 ) / NULLIF( 1.747159 , 0) ) AS c_i_40,
    ZEROIFNULL( (( c_i_37 ) - 0.03895576 ) / NULLIF( 1.722347 , 0) ) AS c_i_41,
    ZEROIFNULL( (( c_i_38 ) - 0.2363556 ) / NULLIF( 2.808312 , 0) ) AS c_i_42,
    -1.118451e-08  +  (-0.07714142) * (COALESCE(c_i_41, 1.610355e-09))  +  (-0.1758817) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_43,
    4.261288e-09  +  (-0.0431946) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.6412595) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_44,
    -7.079888e-09  +  (-0.118112) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.624912) * (COALESCE(c_i_41, 1.610355e-09)) AS c_i_45,
    (0.2604757) * (c_i_43)  +  (-0.681657) * (c_i_44)  +  (-0.6837369) * (c_i_45) AS c_i_46,
    (-0.1047098) * (c_i_43)  +  (0.6840609) * (c_i_44)  +  (-0.7218702) * (c_i_45) AS c_i_47,
    CASE     WHEN c_i_46 < -2.0 THEN 0     WHEN c_i_46 < -1.236351 THEN 1     WHEN c_i_46 < -1.182989 THEN 2     ELSE 3 END AS c_i_48,
    CASE     WHEN c_i_47 < -2.0 THEN 0     WHEN c_i_47 < -0.3139175 THEN 1     WHEN c_i_47 < 0.2286314 THEN 2     ELSE 3 END AS c_i_49
    FROM
        <input_schema>.<input_table_view> t
)

SELECT
    c_i_35 AS row_id,
    c_i_48 AS mypca_pc_1,
    c_i_49 AS mypca_pc_2,
    c_i_39 AS target_class
FROM
preprocessing_steps t

if you want to inspect the pipeline as a chart

mypipeline.plot_sankey()

Output:

Sankey Chart

2.c persist transformed data as a view for later use

this is suitable and compliant with the Teradata ModelOps-Framework, where training.py and scoring.py are separate scripts.

in training.py:

view_train_transf = table_train+"_transf_v"
view_score_transf = table_score+"_transf_v"

mypipeline.fit(schema_name = myschema, 
               table_name = table_train)

# 3. transform: create views for pipelines
# DF_train_transf is already based on the newly created View
DF_train_transf = mypipeline.transform(
                        schema_name = myschema, 
                        table_name = table_train,
                        # this triggeres the creation of a VIEW, thus the transformation 
                        # pipeline is persited
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_train_transf)

# 3.b create view for scoring table - no need for further inspection, thus no return
mypipeline.transform(   schema_name = myschema, 
                        table_name = table_score,
                        return_type = None,
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_score_transf)

# further steps:
# Model training with DF_train_transf, 
# e.g. local and use BYOM, or in-DB with TD_DecisionForest
# save model in DB

in scoring.py:

view_score_transf = table_score+"_transf_v"
# 1. get DataFrame based on View, transform is happening as per view definition
DF_score_transf = tdml.DataFrame(tdml.in_schema(myschema, view_score_transf))

# 2. Model Scoring with trained model 
# (e.g. PMMLPredict, TD_DecisionForestPredict,...)
# get pointer for model from DB + execute scoring in DB

# 3. Save Scored Data in DB

History

v0.1.0 (2023-02-15)

added

  • First release on PyPI.
  • Pipeline with fit and transform functions
  • Preprocessor Functions
    • Impute
    • ImputeText
    • TryCast
    • Scale
    • CutOff
    • FixedWidthBinning
    • ThresholdBinarizer
    • ListBinarizer
    • VariableWidthBinning
    • LabelEncoder
    • CustomTransformer
  • Notebooks for tests
  • Demo Notebook

v0.1.2 (2023-02-15)

fixed

  • added *.sql to MANIFEST.ln such that SQL templates are also part of the distribution.

changed

  • HISTORY and README file from rst to Markdown

v0.1.3 (2023-02-16)

added

  • Quickstart in README file

v0.1.4 (2023-02-17)

added

  • DecisionTreeBinning as Preprocessing function

v1.0.2 (2023-03-06)

Major Overhaul of tdprepview. It now supports transformations that change the schema of the Input DataFrame, like OneHotEncoding. This implementation is based on a directed acyclic graph (DAG).

added

  • plot_sankey() Function for Pipeliine class, which plots the DAG as sankey chart.
  • Preprocessing functions
    • SimpleImputer (adapted from sklearn, based on Impute)
    • IterativeImputer (adapted from sklearn)
    • StandardScaler (adapted from sklearn, based on Scale)
    • MaxAbsScaler (adapted from sklearn, based on Scale)
    • MinMaxScaler (adapted from sklearn, based on Scale)
    • RobustScaler (adapted from sklearn, based on Scale)
    • Normalizer (adapted from sklearn)
    • QuantileTransformer (adapted from sklearn, based on VariableWidthBinning)
    • Binarizer (adapted from sklearn, based on ThresholdBinarizer)
    • PolynomialFeatures (adapted from sklearn)
    • OneHotEncoder (adapted from sklearn)
    • PCA (adapted from sklearn)

changed

  • exactly one query is generated, with one WITH AS part, that contains all preprocessing, one final SELECT, that contains all final columns with the correct column names.
  • the steps in the steps argument of Pipeline are more flexible. They are either tuples with two elements (input_column(s), preprocessor(s)) or tuples with three elements (input_column(s), preprocessor(s), options)
    • input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
      • {'pattern':<regex pattern to search for column names>}
      • {'prefix':<string column names start with>}
      • {'suffix':<string column names end with>}
      • {'dtype_include':<list of tdtypes to include>}
      • {'dtype_exclude':<list of tdtypes to exclude>}
      • filter conditions can be combined
    • preprocessor(s): one or list of Preprocessors.
    • options: dict for renaming column names after applying preprocessor. Useful for filtering them in a later step
      • {'prefix':<adds string to beginning of column names>}
      • {'suffix':<adds string to end of column names>}

v1.1.0 (2024-03-01)

bugfixes, compatibility with newer teradataml version, optional requirements, feature hashing

added

  • Preprocessing functions
    • SimpleHashEncoder (a text column can be hash encoded using in-DB HASHROW)

changed

  • when a pipeline is crystallised in a view, the package checks the teradataml version and uses either tdml.get_context().execute(q) or tdml.execute_sql(q).
  • bugfix: IterativeImputer query generation
  • plotly & seaborn are now optional requirements. Only when plot_sankey() is called, it checks whether the packages are available.

v1.2.0 (2024-03-06)

minor bugfixes, Multi-Label Binarizer

added

  • Preprocessing functions
    • MultiLabelBinarizer (a text column with multiple different values separated by a delimiter can be encoded with multiple binary variables)

v1.3.0 (2024-03-06)

column_exclude options for inputs in steps, Cast function

added

  • Preprocessing functions
    • Cast (analogous to SQL, useful to convert everything but target and key to float as last step)

v1.3.1 (2024-03-26)

added

  • Preprocessing functions
    • PowerTransformer (analogous to sklearn PowerTransformer)

v1.3.2 (2024-03-28)

Fitted Pipeline can now be persisted in a serialised file and reused later.

added

  • Pipeline can now be serialized using the
    • mydict = mypipeline.to_dict() or
    • mypipeline.to_json("mypipeline.json") functions
  • and Pipelines can now analogously be re-created from a serialized representation using the
    • mypipeline = Pipeline.from_dict(mydict) or
    • mypipeline = Pipeline.from_json("mypipeline.json") or functions.

v1.4.0 (2024-04-08)

Introducing automatic Pipeline creation based on heuristics. Either via Pipeline.from_DataFrame(...) or via auto_code(...)

added

  • Pipeline can now automatically created based on a tdml.DataFrame
    • mypipeline = Pipeline.from_DataFrame(DF, non_feature_cols=["rowid","target"], fit_pipeline=True)
    • It'll use heuristics based on datatypes and distributions to decide which preprocessing function would make sense.
  • If you only want to see the code for the steps parameter, you can use
    • steps_str = auto_code(DF, non_feature_cols=["rowid","target"])
    • print(steps_str) # see steps and adjust if needed...
    • steps = eval(steps_str)
    • mypipeline = Pipeline(steps)

v1.5.0 (2024-09-19)

adding TargetEncoder

added

  • Preprocessing Function
    • TargetEncoder as in sklearn: Each category is encoded based on a shrunk estimate of the average target values for observations belonging to the category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdprepview-1.5.0.tar.gz (2.0 MB view hashes)

Uploaded Source

Built Distribution

tdprepview-1.5.0-py2.py3-none-any.whl (72.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page