A package for Script Table Operator that applies set theory to machine learning in Python.
Project description
tdstone2 Package
Overview
tdstone2 operationalizes Python code for machine learning and data analysis on Teradata Vantage using the Script Table Operator (STO). It leverages Teradata's MPP architecture to run hundreds of Python scripts in parallel across hundreds of data partitions — enabling hyper-segmented model deployment with full lineage, versioning, and minimal data movement.
Features
- Hyper-segmented Model Deployment: Train one independent model per data partition in parallel across Teradata AMPs
- Scikit-learn Pipeline Integration: Auto-generate STO scripts from sklearn Pipelines (classifier, regressor, anomaly detection)
- Feature Engineering: Deploy custom or reducer-based feature engineering per partition
- Vector Embeddings: Install HuggingFace/ONNX models and compute embeddings in-database
- Seq2Seq Inference: Deploy summarization and translation models (e.g., flan-t5) via STO
- Model Lineage & Versioning: Temporal tables track every version of every trained model
- Two Execution Backends: Script Table Operator (Vantage Enterprise + Vantage Cloud Enterprise) and
Apply(Vantage Cloud Lake / OAF) — sameHyperModelAPI, picked viause_apply=True - Per-batch Timing Instrumentation: every scored row carries
SCRIPT_UUID,TOTAL_TIME,IMPORT_TIME,LOAD_TIME,PROCESS_TIME,PRINT_TIME,BATCH_NOso you can diagnose which phase dominated wall-clock, per partition, with a singleSELECT - HTML Reports:
train(report=True)/score(report=True)emit inline reports — partition-duration histograms, top-10 longest/fastest, input/output row-count distributions, model-size histogram
Installation
pip install tdstone2
Requires access to a Teradata Vantage system with the Script Table Operator enabled.
Quick Start
1. Generate Test Data
import teradataml as tdml
from tdstone2.dataset_generation import GenerateEquallyDistributedDataSet
tdml.create_context(**Param) # Param = {'host': ..., 'user': ..., 'password': ...}
# Generate a synthetic partitioned dataset (21.6M rows, 216 partitions)
dataset = GenerateEquallyDistributedDataSet(n_partitions=216, n_rows=100000)
dataset.to_sql('dataset_00', schema_name=Param['database'])
Output schema: Partition_ID, ID, X1–X9 (features), Y1, Y2 (targets), flag, FOLD
2. Initialize the Framework
from tdstone2.tdstone import TDStone
sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'])
sto.setup() # creates repository tables + installs STO files in Vantage
For Vantage Cloud Lake (OpenAF / STO-via-OAF):
sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'], oaf_env='my_env')
sto.setup()
For Vantage Cloud Lake via the Apply backend (recommended on Lake — STO is not always available there):
sto = TDStone(
schema_name = Param['database'],
use_apply = True, # route train/score/FE through tdml.Apply
apply_env_name = 'tdstone2_sklearn', # OAF user-environment with sklearn installed
compute_group = 'CG_BusGrpB_ANL', # sets QueryBand 'compute=...' for ACC routing
connect_kwargs = { # explicit cluster pinning (avoids the
'host': Param['host'], # multi-cluster trap when several VCL_*_HOST
'user': Param['user'], # env vars are set at once)
'password': Param['password'],
'database': Param['database'],
},
)
sto.setup() # creates OFS-resident repository tables + installs Apply scripts in the env
The Apply path stores all repository tables with STORAGE = TD_OFSSTORAGE, replaces the
PERIOD FOR ValidPeriod temporal column with a plain CREATION_DATE (OFS rejects temporal
periods), and runs the analog tds_*_apply.py scripts inside the user environment. The
public HyperModel / FeatureEngineering API is identical to the STO path.
Hyper-segmented Models
Deploy a Scikit-learn Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from tdstone2.tdshypermodel import HyperModel
steps = [
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(max_depth=5, n_estimators=95))
]
model_parameters = {
"target": 'Y2',
"column_categorical": ['flag', 'Y2'],
"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'flag']
}
model = HyperModel(
tdstone=sto,
metadata={'project': 'my_project'},
skl_pipeline_steps=steps,
model_parameters=model_parameters,
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
id_fold='FOLD',
fold_training='train',
convert_to_onnx=False, # set True for ONNX export
store_pickle=True,
)
model.train() # trains 216 independent models in parallel (one per partition)
model.score() # scores all data using each partition's model
Deploy a Scikit-learn Regressor
from sklearn.ensemble import RandomForestRegressor
steps = [
('scaler', StandardScaler()),
('regressor', RandomForestRegressor(max_depth=5, n_estimators=95))
]
model_parameters = {
"target": 'Y1',
"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9']
}
model = HyperModel(
tdstone=sto,
metadata={'project': 'regression'},
skl_pipeline_steps=steps,
model_parameters=model_parameters,
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
id_fold='FOLD',
fold_training='train',
)
model.train()
model.score()
Deploy Anomaly Detection (OneClassSVM)
from sklearn.svm import OneClassSVM
steps = [
('scaler', StandardScaler()),
('anomaly', OneClassSVM(nu=0.05))
]
model = HyperModel(
tdstone=sto,
metadata={'project': 'anomaly_detection'},
skl_pipeline_steps=steps,
model_parameters={"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
id_fold='FOLD',
fold_training='train',
)
model.train()
model.score()
# Output: anomaly flag (1=inlier, -1=outlier), decision_function, anomaly_score
Deploy LassoLarsCV (Feature Selection + Regression)
from sklearn.linear_model import LassoLarsCV
steps = [
('scaler', StandardScaler()),
('lasso', LassoLarsCV())
]
model = HyperModel(
tdstone=sto,
metadata={'project': 'lasso_regression'},
skl_pipeline_steps=steps,
model_parameters={"target": 'Y1', "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
id_fold='FOLD',
fold_training='train',
)
model.train()
model.score()
Retrieve Predictions
# Denormalized view: predictions joined with original features
predictions = model.get_model_predictions()
# Raw (normalized) predictions table
predictions_raw = model.get_model_predictions(denormalized_view=False)
# Include per-batch timing columns (SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME,
# PROCESS_TIME, PRINT_TIME, BATCH_NO) — useful for diagnosing per-partition slowness
predictions_with_timing = model.get_model_predictions(include_timing=True)
# Trained model metadata
trained_models = model.get_trained_models()
Inline HTML Reports
Pass report=True to train() or score() to render an inline HTML report with
partition-duration histograms, top-10 longest/fastest partitions, input/output
row-count distributions, and (for training) a model-size histogram:
model.train(report=True)
model.score(report=True)
Inspect the Underlying SQL
# View the generated Script Table Operator SQL
print(model.mapper_scoring.generate_sto_query())
Reload an Existing Hyper-segmented Model
# List all registered hyper-models
sto.list_hyper_models()
# Reload by UUID
existing_model = HyperModel(tdstone=sto)
existing_model.download(id='0286d259-ecde-4cd0-ae4a-bcb3191383d1')
# Retrain and rescore (no code needed — everything is stored in Vantage)
existing_model.train()
existing_model.score()
Feature Engineering
Dimensionality Reduction (Reducer)
from tdstone2.tdsfeature_engineering import FeatureEngineering
fe = FeatureEngineering(
tdstone=sto,
feature_engineering_type='feature engineering reducer',
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
id_fold='FOLD',
fold_training='train',
metadata={'project': 'pca_reduce'},
)
fe.reduce()
reduced = fe.get_reduced_features()
Custom Feature Engineering
Provide a Python script that computes new columns from existing ones:
fe = FeatureEngineering(
tdstone=sto,
feature_engineering_type='feature engineering',
script_path='path/to/Feature_Interactions.py',
dataset=tdml.in_schema(Param['database'], 'dataset_00'),
id_row='ID',
id_partition='Partition_ID',
metadata={'project': 'interactions'},
)
fe.transform()
# All original + derived features denormalized
features = fe.get_computed_features()
# Raw (normalized) features table
features_raw = fe.get_computed_features(denormalized_view=False)
Vector Embeddings
Install a HuggingFace Model (ONNX)
from tdstone2.tdsgenai import (
install_model_in_vantage_from_name,
customize_existing_model,
install_zip_in_vantage,
list_installed_files_byom,
get_model_dimension,
)
# Download, convert to ONNX, customize tokenizer, upload
install_model_in_vantage_from_name(
model_name='intfloat/multilingual-e5-small',
model_task='sentence-similarity',
upload=True,
generate_zip=True,
)
list_installed_files_byom() # verify installation
get_model_dimension(model_name='intfloat/multilingual-e5-small') # e.g., 384
For Vantage Cloud Lake (OAF):
from tdstone2.tdsgenai_lake import install_model_in_vantage_from_name
install_model_in_vantage_from_name(model_name='intfloat/multilingual-e5-small', ...)
Compute Vector Embeddings In-Database (BYOM)
from tdstone2.tdsgenai import compute_vector_embedding_byom
embeddings = compute_vector_embedding_byom(
model='tdstone2_emb_384_intfloat_multilingual_e5_small',
dataset=text_dataframe,
text_column='content',
accumulate_columns=['id'],
schema_name=Param['database'],
table_name='embeddings',
primary_index=['id'],
)
Output: table with id + 384-dimensional embedding vector per row.
Seq2Seq Models (Summarization / Translation / Language Detection)
from tdstone2.tdsgenai_seq2seq import install_seq2seq_model, run_seq2seq
# Install model (e.g., flan-t5-small for summarization)
install_seq2seq_model(
model_name='google/flan-t5-small',
model_task='summarization',
upload=True,
)
# Run via Script Table Operator
results = run_seq2seq(
dataset=text_dataframe,
text_column='content',
model='tdstone2_seq2seq_google_flan_t5_small',
schema_name=Param['database'],
)
Lineage & Registry
sto.list_codes() # registered Python scripts/classes
sto.list_models() # model configs with all arguments
sto.list_mappers() # training/scoring/feature-engineering mappers
sto.list_hyper_models() # hyper-segmented model registrations
sto.list_feature_engineering_models() # feature engineering pipeline registrations
Local Validation / Debugging
Execute the generated model code locally before deploying to Vantage:
code_and_data = model.get_code_and_data(partition_id=1)
exec(code_and_data['code'])
local_model = MyModel(**code_and_data['arguments'])
df_local = code_and_data['data']
df_local['flag'] = df_local['flag'].astype('category')
df_local['Y2'] = df_local['Y2'].astype('category')
local_model.fit(df_local)
local_model.score(df_local)
Demo Notebooks
The demos/ folder contains end-to-end worked examples:
| Series | Location | Content |
|---|---|---|
| Core workflow | demos/notebooks/ |
Data generation → setup → HyperModel → feature engineering → retrieval (01–16) |
| Scikit-learn models | demos/notebooks Demo Hypermodel Scikit-Learn/ |
Classifier, Regressor, Anomaly, LassoLarsCV (STO path) |
| Scikit-learn models on VCL Apply | demos/notebooks Demo Hypermodel Scikit-Learn with OAF/ |
Same flow on Vantage Cloud Lake via use_apply=True (VCL1 / VCL2 variants) |
| Script Table Operator | demos/demo script csae/notebooks/ |
Raw STO usage with anomaly detection |
| BYOM Vector Embedding | demos/notebooks Demo BYOM - Vector Embedding/ |
HuggingFace ONNX install + BYOM embedding |
| OAF Vector Embedding | demos/notebooks Demo OAF - Vector Embedding/ |
HuggingFace model on OpenAF platform |
| STO Vector Embedding | demos/notebooks Demo STO - Vector Embedding/ |
Embedding via Script Table Operator |
| Seq2Seq | demos/notebooks Demo Seq2Seq/ |
Summarization and language detection |
| Chat / Semantic Search | demos/demo chat/notebooks/ |
Chunking, embedding, vector search |
Repository Tables (Default Names)
| Object | Table |
|---|---|
| Code | TDS_CODE_REPOSITORY |
| Model | TDS_MODEL_REPOSITORY |
| Trained Model | TDS_TRAINED_MODEL_REPOSITORY |
| Mapper | TDS_MAPPER_REPOSITORY |
| HyperModel | TDS_HYPER_MODEL_REPOSITORY |
| Feature Engineering | TDS_FEATURE_ENGINEERING_PROCESS_REPOSITORY |
All names are overridable via TDStone.__init__() kwargs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tdstone2-0.1.9.0-py3-none-any.whl.
File metadata
- Download URL: tdstone2-0.1.9.0-py3-none-any.whl
- Upload date:
- Size: 190.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e640b58afd55e1934042d74b6f4d20393f7b3fa65d9a983e5f46a6e51863ca
|
|
| MD5 |
f715a69d0f9850d3c9fbfec207d4dd73
|
|
| BLAKE2b-256 |
19cb70d869274f343fece507b0ae5bf540d1fa7b49fdbb61079fc10b19c98591
|