Python Package that creates Data Preparation Pipeline in Teradata-SQL in Views
Project description
tdprepview
Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.
Installation
pip install tdprepview
Features
- Pipeline class that allows creating in-DB preprocessing pipelines
- Several Preprocessor functions
- API similar to sklearn.Pipeline
- Pipeline can be saved and loaded as dict/json
- Pipeline can be automatically created based on data types and distributions.
Quickstart
1. setting up the pipeline
import teradataml as tdml
database_credentials = {
# ...
}
DbCon = tdml.create_context(**database_credentials)
myschema, table_train, table_score = "...", "...", "..."
DF_train = tdml.DataFrame(tdml.in_schema(myschema,table_train))
DF_score = tdml.DataFrame(tdml.in_schema(myschema,table_score))
from tdprepview import Pipeline
from tdprepview import (
StandardScaler, IterativeImputer, PCA, DecisionTreeBinning )
steps = [
({"pattern":"^floatfeat_[1-5]$"}, [StandardScaler(), IterativeImputer()], {"suffix":"_imputed"}),
({"suffix":"_imputed"}, PCA(n_components=2), {"prefix":"mypca_"}),
({"prefix":"mypca_"}, DecisionTreeBinning(target_var="target_class", no_bins=3))
]
mypipeline = Pipeline(steps=steps)
2.a transform DataFrames directly
this is suitable if you are working in a sandbox Jupyter Notebook
# fit() calculates all necessary statistics in DB
mypipeline.fit(DF=DF_train)
# transform generates SQL syntax based on the chosen Preprocessors and the statistics from fit()
DF_train_transf = mypipeline.transform(DF_train)
DF_score_transf = mypipeline.transform(DF_score)
#... go on with some modelling (e.g. sklearn, TD_DecisionForest,...)
# and scoring (via BYOM, TD_DecisionForestPredict, ...)
2.b inspect produced queries
if you want to check the SQL code, that is generated py tdprepview
mypipeline.fit(DF=DF_train)
query_train = mypipeline.transform(DF_train, return_type = "str")
print(query_train)
the output: (Note how columns not part of the pipeline are simply forwarded)
WITH preprocessing_steps AS
(
SELECT
row_id AS c_i_35,
floatfeat_1 AS c_i_36,
floatfeat_2 AS c_i_37,
floatfeat_3 AS c_i_38,
target_class AS c_i_39,
ZEROIFNULL( (( c_i_36 ) - -0.2331302 ) / NULLIF( 1.747159 , 0) ) AS c_i_40,
ZEROIFNULL( (( c_i_37 ) - 0.03895576 ) / NULLIF( 1.722347 , 0) ) AS c_i_41,
ZEROIFNULL( (( c_i_38 ) - 0.2363556 ) / NULLIF( 2.808312 , 0) ) AS c_i_42,
-1.118451e-08 + (-0.07714142) * (COALESCE(c_i_41, 1.610355e-09)) + (-0.1758817) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_43,
4.261288e-09 + (-0.0431946) * (COALESCE(c_i_40, -1.045776e-08)) + (0.6412595) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_44,
-7.079888e-09 + (-0.118112) * (COALESCE(c_i_40, -1.045776e-08)) + (0.624912) * (COALESCE(c_i_41, 1.610355e-09)) AS c_i_45,
(0.2604757) * (c_i_43) + (-0.681657) * (c_i_44) + (-0.6837369) * (c_i_45) AS c_i_46,
(-0.1047098) * (c_i_43) + (0.6840609) * (c_i_44) + (-0.7218702) * (c_i_45) AS c_i_47,
CASE WHEN c_i_46 < -2.0 THEN 0 WHEN c_i_46 < -1.236351 THEN 1 WHEN c_i_46 < -1.182989 THEN 2 ELSE 3 END AS c_i_48,
CASE WHEN c_i_47 < -2.0 THEN 0 WHEN c_i_47 < -0.3139175 THEN 1 WHEN c_i_47 < 0.2286314 THEN 2 ELSE 3 END AS c_i_49
FROM
<input_schema>.<input_table_view> t
)
SELECT
c_i_35 AS row_id,
c_i_48 AS mypca_pc_1,
c_i_49 AS mypca_pc_2,
c_i_39 AS target_class
FROM
preprocessing_steps t
if you want to inspect the pipeline as a chart
mypipeline.plot_sankey()
Output:
2.c persist transformed data as a view for later use
this is suitable and compliant with the Teradata ModelOps-Framework, where training.py and scoring.py are separate scripts.
in training.py
:
view_train_transf = table_train+"_transf_v"
view_score_transf = table_score+"_transf_v"
mypipeline.fit(schema_name = myschema,
table_name = table_train)
# 3. transform: create views for pipelines
# DF_train_transf is already based on the newly created View
DF_train_transf = mypipeline.transform(
schema_name = myschema,
table_name = table_train,
# this triggeres the creation of a VIEW, thus the transformation
# pipeline is persited
create_replace_view = True,
output_schema_name = myschema,
output_view_name= view_train_transf)
# 3.b create view for scoring table - no need for further inspection, thus no return
mypipeline.transform( schema_name = myschema,
table_name = table_score,
return_type = None,
create_replace_view = True,
output_schema_name = myschema,
output_view_name= view_score_transf)
# further steps:
# Model training with DF_train_transf,
# e.g. local and use BYOM, or in-DB with TD_DecisionForest
# save model in DB
in scoring.py
:
view_score_transf = table_score+"_transf_v"
# 1. get DataFrame based on View, transform is happening as per view definition
DF_score_transf = tdml.DataFrame(tdml.in_schema(myschema, view_score_transf))
# 2. Model Scoring with trained model
# (e.g. PMMLPredict, TD_DecisionForestPredict,...)
# get pointer for model from DB + execute scoring in DB
# 3. Save Scored Data in DB
History
v0.1.0 (2023-02-15)
added
- First release on PyPI.
- Pipeline with fit and transform functions
- Preprocessor Functions
- Impute
- ImputeText
- TryCast
- Scale
- CutOff
- FixedWidthBinning
- ThresholdBinarizer
- ListBinarizer
- VariableWidthBinning
- LabelEncoder
- CustomTransformer
- Notebooks for tests
- Demo Notebook
v0.1.2 (2023-02-15)
fixed
- added *.sql to MANIFEST.ln such that SQL templates are also part of the distribution.
changed
- HISTORY and README file from rst to Markdown
v0.1.3 (2023-02-16)
added
- Quickstart in README file
v0.1.4 (2023-02-17)
added
- DecisionTreeBinning as Preprocessing function
v1.0.2 (2023-03-06)
Major Overhaul of tdprepview. It now supports transformations that change the schema of the Input DataFrame, like OneHotEncoding. This implementation is based on a directed acyclic graph (DAG).
added
- plot_sankey() Function for Pipeliine class, which plots the DAG as sankey chart.
- Preprocessing functions
- SimpleImputer (adapted from sklearn, based on Impute)
- IterativeImputer (adapted from sklearn)
- StandardScaler (adapted from sklearn, based on Scale)
- MaxAbsScaler (adapted from sklearn, based on Scale)
- MinMaxScaler (adapted from sklearn, based on Scale)
- RobustScaler (adapted from sklearn, based on Scale)
- Normalizer (adapted from sklearn)
- QuantileTransformer (adapted from sklearn, based on VariableWidthBinning)
- Binarizer (adapted from sklearn, based on ThresholdBinarizer)
- PolynomialFeatures (adapted from sklearn)
- OneHotEncoder (adapted from sklearn)
- PCA (adapted from sklearn)
changed
- exactly one query is generated, with one WITH AS part, that contains all preprocessing, one final SELECT, that contains all final columns with the correct column names.
- the steps in the steps argument of Pipeline are more flexible. They are either tuples with two elements (input_column(s), preprocessor(s)) or tuples with three elements (input_column(s), preprocessor(s), options)
- input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
{'pattern':<regex pattern to search for column names>}
{'prefix':<string column names start with>}
{'suffix':<string column names end with>}
{'dtype_include':<list of tdtypes to include>}
{'dtype_exclude':<list of tdtypes to exclude>}
- filter conditions can be combined
- preprocessor(s): one or list of Preprocessors.
- options: dict for renaming column names after applying preprocessor. Useful for filtering them in a later step
{'prefix':<adds string to beginning of column names>}
{'suffix':<adds string to end of column names>}
- input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
v1.1.0 (2024-03-01)
bugfixes, compatibility with newer teradataml version, optional requirements, feature hashing
added
- Preprocessing functions
- SimpleHashEncoder (a text column can be hash encoded using in-DB HASHROW)
changed
- when a pipeline is crystallised in a view, the package checks the teradataml version and uses either
tdml.get_context().execute(q)
ortdml.execute_sql(q)
. - bugfix:
IterativeImputer
query generation - plotly & seaborn are now optional requirements. Only when
plot_sankey()
is called, it checks whether the packages are available.
v1.2.0 (2024-03-06)
minor bugfixes, Multi-Label Binarizer
added
- Preprocessing functions
- MultiLabelBinarizer (a text column with multiple different values separated by a delimiter can be encoded with multiple binary variables)
v1.3.0 (2024-03-06)
column_exclude
options for inputs in steps, Cast function
added
- Preprocessing functions
- Cast (analogous to SQL, useful to convert everything but target and key to float as last step)
v1.3.1 (2024-03-26)
added
- Preprocessing functions
- PowerTransformer (analogous to sklearn PowerTransformer)
v1.3.2 (2024-03-28)
Fitted Pipeline can now be persisted in a serialised file and reused later.
added
- Pipeline can now be serialized using the
mydict = mypipeline.to_dict()
ormypipeline.to_json("mypipeline.json")
functions
- and Pipelines can now analogously be re-created from a serialized representation using the
mypipeline = Pipeline.from_dict(mydict)
ormypipeline = Pipeline.from_json("mypipeline.json")
or functions.
v1.4.0 (2024-04-08)
Introducing automatic Pipeline creation based on heuristics. Either via Pipeline.from_DataFrame(...)
or via auto_code(...)
added
- Pipeline can now automatically created based on a tdml.DataFrame
mypipeline = Pipeline.from_DataFrame(DF, non_feature_cols=["rowid","target"], fit_pipeline=True)
- It'll use heuristics based on datatypes and distributions to decide which preprocessing function would make sense.
- If you only want to see the code for the
steps
parameter, you can usesteps_str = auto_code(DF, non_feature_cols=["rowid","target"])
print(steps_str) # see steps and adjust if needed...
steps = eval(steps_str)
mypipeline = Pipeline(steps)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tdprepview-1.4.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2d6589833d2fd8e8cedfb44a8b4a6e5c82557c06dfe7c9b98d49da4ca997c00 |
|
MD5 | 3dac829aa4e50d14eaae46615317eeee |
|
BLAKE2b-256 | 2eac8cbaabbf085ba9ac0e9b31e62e132d0133b389fe3ec094cae15d0a958cc2 |