Python Package that creates Data Preparation Pipeline in Teradata-SQL in Views
Project description
tdprepview
Python Package that creates Data Preparation Pipelines in Views written in Teradata-SQL.
Installation
pip install tdprepview
Features
- Pipeline class that allows creating in-DB preprocessing pipelines
- Several Preprocessor functions
- API similar to sklearn.Pipeline
- Pipeline can be saved and loaded as dict/json
- Pipeline can be automatically created based on data types and distributions.
Quickstart
1. setting up the pipeline
import teradataml as tdml
database_credentials = {
# ...
}
DbCon = tdml.create_context(**database_credentials)
myschema, table_train, table_score = "...", "...", "..."
DF_train = tdml.DataFrame(tdml.in_schema(myschema,table_train))
DF_score = tdml.DataFrame(tdml.in_schema(myschema,table_score))
from tdprepview import Pipeline
from tdprepview import (
StandardScaler, IterativeImputer, PCA, DecisionTreeBinning )
steps = [
({"pattern":"^floatfeat_[1-5]$"}, [StandardScaler(), IterativeImputer()], {"suffix":"_imputed"}),
({"suffix":"_imputed"}, PCA(n_components=2), {"prefix":"mypca_"}),
({"prefix":"mypca_"}, DecisionTreeBinning(target_var="target_class", no_bins=3))
]
mypipeline = Pipeline(steps=steps)
2.a transform DataFrames directly
this is suitable if you are working in a sandbox Jupyter Notebook
# fit() calculates all necessary statistics in DB
mypipeline.fit(DF=DF_train)
# transform generates SQL syntax based on the chosen Preprocessors and the statistics from fit()
DF_train_transf = mypipeline.transform(DF_train)
DF_score_transf = mypipeline.transform(DF_score)
#... go on with some modelling (e.g. sklearn, TD_DecisionForest,...)
# and scoring (via BYOM, TD_DecisionForestPredict, ...)
2.b inspect produced queries
if you want to check the SQL code, that is generated py tdprepview
mypipeline.fit(DF=DF_train)
query_train = mypipeline.transform(DF_train, return_type = "str")
print(query_train)
the output: (Note how columns not part of the pipeline are simply forwarded)
WITH preprocessing_steps AS
(
SELECT
row_id AS c_i_35,
floatfeat_1 AS c_i_36,
floatfeat_2 AS c_i_37,
floatfeat_3 AS c_i_38,
target_class AS c_i_39,
ZEROIFNULL( (( c_i_36 ) - -0.2331302 ) / NULLIF( 1.747159 , 0) ) AS c_i_40,
ZEROIFNULL( (( c_i_37 ) - 0.03895576 ) / NULLIF( 1.722347 , 0) ) AS c_i_41,
ZEROIFNULL( (( c_i_38 ) - 0.2363556 ) / NULLIF( 2.808312 , 0) ) AS c_i_42,
-1.118451e-08 + (-0.07714142) * (COALESCE(c_i_41, 1.610355e-09)) + (-0.1758817) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_43,
4.261288e-09 + (-0.0431946) * (COALESCE(c_i_40, -1.045776e-08)) + (0.6412595) * (COALESCE(c_i_42, -4.838372e-09)) AS c_i_44,
-7.079888e-09 + (-0.118112) * (COALESCE(c_i_40, -1.045776e-08)) + (0.624912) * (COALESCE(c_i_41, 1.610355e-09)) AS c_i_45,
(0.2604757) * (c_i_43) + (-0.681657) * (c_i_44) + (-0.6837369) * (c_i_45) AS c_i_46,
(-0.1047098) * (c_i_43) + (0.6840609) * (c_i_44) + (-0.7218702) * (c_i_45) AS c_i_47,
CASE WHEN c_i_46 < -2.0 THEN 0 WHEN c_i_46 < -1.236351 THEN 1 WHEN c_i_46 < -1.182989 THEN 2 ELSE 3 END AS c_i_48,
CASE WHEN c_i_47 < -2.0 THEN 0 WHEN c_i_47 < -0.3139175 THEN 1 WHEN c_i_47 < 0.2286314 THEN 2 ELSE 3 END AS c_i_49
FROM
<input_schema>.<input_table_view> t
)
SELECT
c_i_35 AS row_id,
c_i_48 AS mypca_pc_1,
c_i_49 AS mypca_pc_2,
c_i_39 AS target_class
FROM
preprocessing_steps t
if you want to inspect the pipeline as a chart
mypipeline.plot_sankey()
Output:
2.c persist transformed data as a view for later use
this is suitable and compliant with the Teradata ModelOps-Framework, where training.py and scoring.py are separate scripts.
in training.py
:
view_train_transf = table_train+"_transf_v"
view_score_transf = table_score+"_transf_v"
mypipeline.fit(schema_name = myschema,
table_name = table_train)
# 3. transform: create views for pipelines
# DF_train_transf is already based on the newly created View
DF_train_transf = mypipeline.transform(
schema_name = myschema,
table_name = table_train,
# this triggeres the creation of a VIEW, thus the transformation
# pipeline is persited
create_replace_view = True,
output_schema_name = myschema,
output_view_name= view_train_transf)
# 3.b create view for scoring table - no need for further inspection, thus no return
mypipeline.transform( schema_name = myschema,
table_name = table_score,
return_type = None,
create_replace_view = True,
output_schema_name = myschema,
output_view_name= view_score_transf)
# further steps:
# Model training with DF_train_transf,
# e.g. local and use BYOM, or in-DB with TD_DecisionForest
# save model in DB
in scoring.py
:
view_score_transf = table_score+"_transf_v"
# 1. get DataFrame based on View, transform is happening as per view definition
DF_score_transf = tdml.DataFrame(tdml.in_schema(myschema, view_score_transf))
# 2. Model Scoring with trained model
# (e.g. PMMLPredict, TD_DecisionForestPredict,...)
# get pointer for model from DB + execute scoring in DB
# 3. Save Scored Data in DB
History
v0.1.0 (2023-02-15)
added
- First release on PyPI.
- Pipeline with fit and transform functions
- Preprocessor Functions
- Impute
- ImputeText
- TryCast
- Scale
- CutOff
- FixedWidthBinning
- ThresholdBinarizer
- ListBinarizer
- VariableWidthBinning
- LabelEncoder
- CustomTransformer
- Notebooks for tests
- Demo Notebook
v0.1.2 (2023-02-15)
fixed
- added *.sql to MANIFEST.ln such that SQL templates are also part of the distribution.
changed
- HISTORY and README file from rst to Markdown
v0.1.3 (2023-02-16)
added
- Quickstart in README file
v0.1.4 (2023-02-17)
added
- DecisionTreeBinning as Preprocessing function
v1.0.2 (2023-03-06)
Major Overhaul of tdprepview. It now supports transformations that change the schema of the Input DataFrame, like OneHotEncoding. This implementation is based on a directed acyclic graph (DAG).
added
- plot_sankey() Function for Pipeliine class, which plots the DAG as sankey chart.
- Preprocessing functions
- SimpleImputer (adapted from sklearn, based on Impute)
- IterativeImputer (adapted from sklearn)
- StandardScaler (adapted from sklearn, based on Scale)
- MaxAbsScaler (adapted from sklearn, based on Scale)
- MinMaxScaler (adapted from sklearn, based on Scale)
- RobustScaler (adapted from sklearn, based on Scale)
- Normalizer (adapted from sklearn)
- QuantileTransformer (adapted from sklearn, based on VariableWidthBinning)
- Binarizer (adapted from sklearn, based on ThresholdBinarizer)
- PolynomialFeatures (adapted from sklearn)
- OneHotEncoder (adapted from sklearn)
- PCA (adapted from sklearn)
changed
- exactly one query is generated, with one WITH AS part, that contains all preprocessing, one final SELECT, that contains all final columns with the correct column names.
- the steps in the steps argument of Pipeline are more flexible. They are either tuples with two elements (input_column(s), preprocessor(s)) or tuples with three elements (input_column(s), preprocessor(s), options)
- input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
{'pattern':<regex pattern to search for column names>}
{'prefix':<string column names start with>}
{'suffix':<string column names end with>}
{'dtype_include':<list of tdtypes to include>}
{'dtype_exclude':<list of tdtypes to exclude>}
- filter conditions can be combined
- preprocessor(s): one or list of Preprocessors.
- options: dict for renaming column names after applying preprocessor. Useful for filtering them in a later step
{'prefix':<adds string to beginning of column names>}
{'suffix':<adds string to end of column names>}
- input_column(s): string, list of strings or NEW dictionary for dynamic filtering:
v1.1.0 (2024-03-01)
bugfixes, compatibility with newer teradataml version, optional requirements, feature hashing
added
- Preprocessing functions
- SimpleHashEncoder (a text column can be hash encoded using in-DB HASHROW)
changed
- when a pipeline is crystallised in a view, the package checks the teradataml version and uses either
tdml.get_context().execute(q)
ortdml.execute_sql(q)
. - bugfix:
IterativeImputer
query generation - plotly & seaborn are now optional requirements. Only when
plot_sankey()
is called, it checks whether the packages are available.
v1.2.0 (2024-03-06)
minor bugfixes, Multi-Label Binarizer
added
- Preprocessing functions
- MultiLabelBinarizer (a text column with multiple different values separated by a delimiter can be encoded with multiple binary variables)
v1.3.0 (2024-03-06)
column_exclude
options for inputs in steps, Cast function
added
- Preprocessing functions
- Cast (analogous to SQL, useful to convert everything but target and key to float as last step)
v1.3.1 (2024-03-26)
added
- Preprocessing functions
- PowerTransformer (analogous to sklearn PowerTransformer)
v1.3.2 (2024-03-28)
Fitted Pipeline can now be persisted in a serialised file and reused later.
added
- Pipeline can now be serialized using the
mydict = mypipeline.to_dict()
ormypipeline.to_json("mypipeline.json")
functions
- and Pipelines can now analogously be re-created from a serialized representation using the
mypipeline = Pipeline.from_dict(mydict)
ormypipeline = Pipeline.from_json("mypipeline.json")
or functions.
v1.4.0 (2024-04-08)
Introducing automatic Pipeline creation based on heuristics. Either via Pipeline.from_DataFrame(...)
or via auto_code(...)
added
- Pipeline can now automatically created based on a tdml.DataFrame
mypipeline = Pipeline.from_DataFrame(DF, non_feature_cols=["rowid","target"], fit_pipeline=True)
- It'll use heuristics based on datatypes and distributions to decide which preprocessing function would make sense.
- If you only want to see the code for the
steps
parameter, you can usesteps_str = auto_code(DF, non_feature_cols=["rowid","target"])
print(steps_str) # see steps and adjust if needed...
steps = eval(steps_str)
mypipeline = Pipeline(steps)
v1.5.0 (2024-09-19)
adding TargetEncoder
added
- Preprocessing Function
TargetEncoder
as in sklearn: Each category is encoded based on a shrunk estimate of the average target values for observations belonging to the category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tdprepview-1.5.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aa1abf8b1b61bbedf093e3b55a38b5750d80ff68551761a9939a99fb3919fa0 |
|
MD5 | 9c211205d636c65e83c244723aec4fdd |
|
BLAKE2b-256 | 89ad64548631bfcf9c01a9f75b0733114d56362b475b663531da31c75629e55b |