Python Package that creates Data Preparation Pipeline in Teradata-SQL in Views

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
Topic
- Database

Project description

Logo

tdprepview

Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.

Installation

pip install tdprepview

Features

Pipeline class that allows creating in-DB preprocessing pipelines
Several Preprocessor functions
API similar to sklearn.Pipeline
Pipeline can be saved and loaded as dict/json
Pipeline can be automatically created based on data types and distributions.

Preprocessors

Quickstart

1. setting up the pipeline

import teradataml as tdml
database_credentials = {
    # ...
}
DbCon = tdml.create_context(**database_credentials)

myschema, table_train, table_score = "...", "...", "..."
DF_train = tdml.DataFrame(tdml.in_schema(myschema,table_train))
DF_score = tdml.DataFrame(tdml.in_schema(myschema,table_score))

from tdprepview import Pipeline
from tdprepview import (
    StandardScaler, IterativeImputer, PCA, DecisionTreeBinning )

steps = [
    ({"pattern":"^floatfeat_[1-5]$"}, [StandardScaler(), IterativeImputer()], {"suffix":"_imputed"}),
    ({"suffix":"_imputed"}, PCA(n_components=2), {"prefix":"mypca_"}),
    ({"prefix":"mypca_"}, DecisionTreeBinning(target_var="target_class", no_bins=3))
]

mypipeline = Pipeline(steps=steps)

2.a transform DataFrames directly

this is suitable if you are working in a sandbox Jupyter Notebook

# fit() calculates all necessary statistics in DB
mypipeline.fit(DF=DF_train)
# transform generates SQL syntax based on the chosen Preprocessors and the statistics from fit()
DF_train_transf = mypipeline.transform(DF_train)
DF_score_transf = mypipeline.transform(DF_score)
#... go on with some modelling (e.g. sklearn, TD_DecisionForest,...) 
# and scoring (via BYOM, TD_DecisionForestPredict, ...)

2.b inspect produced queries

if you want to check the SQL code, that is generated py tdprepview

mypipeline.fit(DF=DF_train)
query_train = mypipeline.transform(DF_train, return_type = "str")
print(query_train)

the output: (Note how columns not part of the pipeline are simply forwarded)

WITH preprocessing_steps AS
(
    SELECT
    row_id AS c_i_35,
    floatfeat_1 AS c_i_36,
    floatfeat_2 AS c_i_37,
    floatfeat_3 AS c_i_38,
    target_class AS c_i_39,
    ZEROIFNULL( (( c_i_36 ) - -0.2331302 ) / NULLIF( 1.747159 , 0) ) AS c_i_40,
    ZEROIFNULL( (( c_i_37 ) - 0.03895576 ) / NULLIF( 1.722347 , 0) ) AS c_i_41,
    ZEROIFNULL( (( c_i_38 ) - 0.2363556 ) / NULLIF( 2.808312 , 0) ) AS c_i_42,
    -1.118451e-08  +  (-0.07714142) * (COALESCE(c_i_41, 1.610355e-09))  +  (-0.1758817) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_43,
    4.261288e-09  +  (-0.0431946) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.6412595) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_44,
    -7.079888e-09  +  (-0.118112) * (COALESCE(c_i_40, -1.045776e-08))  +  (0.624912) * (COALESCE(c_i_41, 1.610355e-09)) AS c_i_45,
    (0.2604757) * (c_i_43)  +  (-0.681657) * (c_i_44)  +  (-0.6837369) * (c_i_45) AS c_i_46,
    (-0.1047098) * (c_i_43)  +  (0.6840609) * (c_i_44)  +  (-0.7218702) * (c_i_45) AS c_i_47,
    CASE     WHEN c_i_46 < -2.0 THEN 0     WHEN c_i_46 < -1.236351 THEN 1     WHEN c_i_46 < -1.182989 THEN 2     ELSE 3 END AS c_i_48,
    CASE     WHEN c_i_47 < -2.0 THEN 0     WHEN c_i_47 < -0.3139175 THEN 1     WHEN c_i_47 < 0.2286314 THEN 2     ELSE 3 END AS c_i_49
    FROM
        <input_schema>.<input_table_view> t
)

SELECT
    c_i_35 AS row_id,
    c_i_48 AS mypca_pc_1,
    c_i_49 AS mypca_pc_2,
    c_i_39 AS target_class
FROM
preprocessing_steps t

if you want to inspect the pipeline as a chart

mypipeline.plot_sankey()

Output:

Sankey Chart

2.c persist transformed data as a view for later use

this is suitable and compliant with the Teradata ModelOps-Framework, where training.py and scoring.py are separate scripts.

in training.py:

view_train_transf = table_train+"_transf_v"
view_score_transf = table_score+"_transf_v"

mypipeline.fit(schema_name = myschema, 
               table_name = table_train)

# 3. transform: create views for pipelines
# DF_train_transf is already based on the newly created View
DF_train_transf = mypipeline.transform(
                        schema_name = myschema, 
                        table_name = table_train,
                        # this triggeres the creation of a VIEW, thus the transformation 
                        # pipeline is persited
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_train_transf)

# 3.b create view for scoring table - no need for further inspection, thus no return
mypipeline.transform(   schema_name = myschema, 
                        table_name = table_score,
                        return_type = None,
                        create_replace_view = True, 
                        output_schema_name = myschema, 
                        output_view_name= view_score_transf)

# further steps:
# Model training with DF_train_transf, 
# e.g. local and use BYOM, or in-DB with TD_DecisionForest
# save model in DB

in scoring.py:

view_score_transf = table_score+"_transf_v"
# 1. get DataFrame based on View, transform is happening as per view definition
DF_score_transf = tdml.DataFrame(tdml.in_schema(myschema, view_score_transf))

# 2. Model Scoring with trained model 
# (e.g. PMMLPredict, TD_DecisionForestPredict,...)
# get pointer for model from DB + execute scoring in DB

# 3. Save Scored Data in DB

History

v0.1.0 (2023-02-15)

added

First release on PyPI.
Pipeline with fit and transform functions
Preprocessor Functions
- Impute
- ImputeText
- TryCast
- Scale
- CutOff
- FixedWidthBinning
- ThresholdBinarizer
- ListBinarizer
- VariableWidthBinning
- LabelEncoder
- CustomTransformer
Notebooks for tests
Demo Notebook

v0.1.2 (2023-02-15)

fixed

added *.sql to MANIFEST.ln such that SQL templates are also part of the distribution.

changed

HISTORY and README file from rst to Markdown

v0.1.3 (2023-02-16)

added

Quickstart in README file

v0.1.4 (2023-02-17)

added

DecisionTreeBinning as Preprocessing function

v1.0.2 (2023-03-06)

Major Overhaul of tdprepview. It now supports transformations that change the schema of the Input DataFrame, like OneHotEncoding. This implementation is based on a directed acyclic graph (DAG).

added

plot_sankey() Function for Pipeliine class, which plots the DAG as sankey chart.
Preprocessing functions
- SimpleImputer (adapted from sklearn, based on Impute)
- IterativeImputer (adapted from sklearn)
- StandardScaler (adapted from sklearn, based on Scale)
- MaxAbsScaler (adapted from sklearn, based on Scale)
- MinMaxScaler (adapted from sklearn, based on Scale)
- RobustScaler (adapted from sklearn, based on Scale)
- Normalizer (adapted from sklearn)
- QuantileTransformer (adapted from sklearn, based on VariableWidthBinning)
- Binarizer (adapted from sklearn, based on ThresholdBinarizer)
- PolynomialFeatures (adapted from sklearn)
- OneHotEncoder (adapted from sklearn)
- PCA (adapted from sklearn)

changed

exactly one query is generated, with one WITH AS part, that contains all preprocessing, one final SELECT, that contains all final columns with the correct column names.
the steps in the steps argument of Pipeline are more flexible. They are either tuples with two elements (input_column(s), preprocessor(s)) or tuples with three elements (input_column(s), preprocessor(s), options)
- input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
  - {'pattern':<regex pattern to search for column names>}
  - {'prefix':<string column names start with>}
  - {'suffix':<string column names end with>}
  - {'dtype_include':<list of tdtypes to include>}
  - {'dtype_exclude':<list of tdtypes to exclude>}
  - filter conditions can be combined
- preprocessor(s): one or list of Preprocessors.
- options: dict for renaming column names after applying preprocessor. Useful for filtering them in a later step
  - {'prefix':<adds string to beginning of column names>}
  - {'suffix':<adds string to end of column names>}

v1.1.0 (2024-03-01)

bugfixes, compatibility with newer teradataml version, optional requirements, feature hashing

added

Preprocessing functions
- SimpleHashEncoder (a text column can be hash encoded using in-DB HASHROW)

changed

when a pipeline is crystallised in a view, the package checks the teradataml version and uses either tdml.get_context().execute(q) or tdml.execute_sql(q).
bugfix: IterativeImputer query generation
plotly & seaborn are now optional requirements. Only when plot_sankey() is called, it checks whether the packages are available.

v1.2.0 (2024-03-06)

minor bugfixes, Multi-Label Binarizer

added

Preprocessing functions
- MultiLabelBinarizer (a text column with multiple different values separated by a delimiter can be encoded with multiple binary variables)

v1.3.0 (2024-03-06)

column_exclude options for inputs in steps, Cast function

added

Preprocessing functions
- Cast (analogous to SQL, useful to convert everything but target and key to float as last step)

v1.3.1 (2024-03-26)

added

Preprocessing functions
- PowerTransformer (analogous to sklearn PowerTransformer)

v1.3.2 (2024-03-28)

Fitted Pipeline can now be persisted in a serialised file and reused later.

added

Pipeline can now be serialized using the
- mydict = mypipeline.to_dict() or
- mypipeline.to_json("mypipeline.json") functions
and Pipelines can now analogously be re-created from a serialized representation using the
- mypipeline = Pipeline.from_dict(mydict) or
- mypipeline = Pipeline.from_json("mypipeline.json") or functions.

v1.4.0 (2024-04-08)

Introducing automatic Pipeline creation based on heuristics. Either via Pipeline.from_DataFrame(...) or via auto_code(...)

added

Pipeline can now automatically created based on a tdml.DataFrame
- mypipeline = Pipeline.from_DataFrame(DF, non_feature_cols=["rowid","target"], fit_pipeline=True)
- It'll use heuristics based on datatypes and distributions to decide which preprocessing function would make sense.
If you only want to see the code for the steps parameter, you can use
- steps_str = auto_code(DF, non_feature_cols=["rowid","target"])
- print(steps_str) # see steps and adjust if needed...
- steps = eval(steps_str)
- mypipeline = Pipeline(steps)

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
Topic
- Database

Release history Release notifications | RSS feed

This version

1.4.1

Apr 23, 2024

1.4.0

Apr 9, 2024

1.3.2

Mar 27, 2024

1.3.1

Mar 26, 2024

1.3.0

Mar 6, 2024

1.2.0

Mar 5, 2024

1.1.4

Mar 4, 2024

1.0.2

Mar 6, 2023

0.1.3

Feb 16, 2023

0.1.2

Feb 15, 2023

0.1.0

Feb 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdprepview-1.4.1.tar.gz (1.7 MB view hashes)

Uploaded Apr 23, 2024 Source

Built Distribution

tdprepview-1.4.1-py2.py3-none-any.whl (68.9 kB view hashes)

Uploaded Apr 23, 2024 Python 2 Python 3

Hashes for tdprepview-1.4.1.tar.gz

Hashes for tdprepview-1.4.1.tar.gz
Algorithm	Hash digest
SHA256	`5b9d403097d4ddeae107351ed7cd4aff8eeecb4082b7639a9665719729e90eda`
MD5	`ab9edab06ec53cfb5e41e912bde26b29`
BLAKE2b-256	`53b8da381b89cef3437ea4b77967bd8ea62c2d5255e8ce20e4b5afab7c1647db`

Hashes for tdprepview-1.4.1-py2.py3-none-any.whl

Hashes for tdprepview-1.4.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2d6589833d2fd8e8cedfb44a8b4a6e5c82557c06dfe7c9b98d49da4ca997c00`
MD5	`3dac829aa4e50d14eaae46615317eeee`
BLAKE2b-256	`2eac8cbaabbf085ba9ac0e9b31e62e132d0133b389fe3ec094cae15d0a958cc2`

tdprepview 1.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tdprepview

Installation

Features

Quickstart

1. setting up the pipeline

2.a transform DataFrames directly

2.b inspect produced queries

2.c persist transformed data as a view for later use

History

v0.1.0 (2023-02-15)

added

v0.1.2 (2023-02-15)

fixed

changed

v0.1.3 (2023-02-16)

added

v0.1.4 (2023-02-17)

added

v1.0.2 (2023-03-06)

added

changed

v1.1.0 (2024-03-01)

added

changed

v1.2.0 (2024-03-06)

added

v1.3.0 (2024-03-06)

added

v1.3.1 (2024-03-26)

added

v1.3.2 (2024-03-28)

added

v1.4.0 (2024-04-08)

added

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution