Skip to main content

A package for Script Table Operator that applies set theory to machine learning in Python.

Project description

tdstone2 Package

Overview

tdstone2 operationalizes Python code for machine learning and data analysis on Teradata Vantage using the Script Table Operator (STO). It leverages Teradata's MPP architecture to run hundreds of Python scripts in parallel across hundreds of data partitions — enabling hyper-segmented model deployment with full lineage, versioning, and minimal data movement.

Features

  • Hyper-segmented Model Deployment: Train one independent model per data partition in parallel across Teradata AMPs
  • Scikit-learn Pipeline Integration: Auto-generate STO scripts from sklearn Pipelines (classifier, regressor, anomaly detection)
  • Feature Engineering: Deploy custom or reducer-based feature engineering per partition
  • Vector Embeddings: Install HuggingFace/ONNX models and compute embeddings in-database
  • Seq2Seq Inference: Deploy summarization and translation models (e.g., flan-t5) via STO
  • Model Lineage & Versioning: Temporal tables track every version of every trained model
  • Two Execution Backends: Script Table Operator (Vantage Enterprise + Vantage Cloud Enterprise) and Apply (Vantage Cloud Lake / OAF) — same HyperModel API, picked via use_apply=True
  • Per-batch Timing Instrumentation: every scored row carries SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME, PROCESS_TIME, PRINT_TIME, BATCH_NO so you can diagnose which phase dominated wall-clock, per partition, with a single SELECT
  • HTML Reports: train(report=True) / score(report=True) emit inline reports — partition-duration histograms, top-10 longest/fastest, input/output row-count distributions, model-size histogram

Installation

pip install tdstone2

Requires access to a Teradata Vantage system with the Script Table Operator enabled.


Quick Start

1. Generate Test Data

import teradataml as tdml
from tdstone2.dataset_generation import GenerateEquallyDistributedDataSet

tdml.create_context(**Param)  # Param = {'host': ..., 'user': ..., 'password': ...}

# Generate a synthetic partitioned dataset (21.6M rows, 216 partitions)
dataset = GenerateEquallyDistributedDataSet(n_partitions=216, n_rows=100000)
dataset.to_sql('dataset_00', schema_name=Param['database'])

Output schema: Partition_ID, ID, X1X9 (features), Y1, Y2 (targets), flag, FOLD

2. Initialize the Framework

from tdstone2.tdstone import TDStone

sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'])
sto.setup()  # creates repository tables + installs STO files in Vantage

For Vantage Cloud Lake (OpenAF / STO-via-OAF):

sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'], oaf_env='my_env')
sto.setup()

For Vantage Cloud Lake via the Apply backend (recommended on Lake — STO is not always available there):

sto = TDStone(
    schema_name    = Param['database'],
    use_apply      = True,                  # route train/score/FE through tdml.Apply
    apply_env_name = 'tdstone2_sklearn',    # OAF user-environment with sklearn installed
    compute_group  = 'CG_BusGrpB_ANL',      # sets QueryBand 'compute=...' for ACC routing
    connect_kwargs = {                      # explicit cluster pinning (avoids the
        'host':     Param['host'],          # multi-cluster trap when several VCL_*_HOST
        'user':     Param['user'],          # env vars are set at once)
        'password': Param['password'],
        'database': Param['database'],
    },
)
sto.setup()  # creates OFS-resident repository tables + installs Apply scripts in the env

The Apply path stores all repository tables with STORAGE = TD_OFSSTORAGE, replaces the PERIOD FOR ValidPeriod temporal column with a plain CREATION_DATE (OFS rejects temporal periods), and runs the analog tds_*_apply.py scripts inside the user environment. The public HyperModel / FeatureEngineering API is identical to the STO path.


Hyper-segmented Models

Deploy a Scikit-learn Classifier

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from tdstone2.tdshypermodel import HyperModel

steps = [
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y2',
    "column_categorical": ['flag', 'Y2'],
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'flag']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'my_project'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    convert_to_onnx=False,   # set True for ONNX export
    store_pickle=True,
)

model.train()   # trains 216 independent models in parallel (one per partition)
model.score()   # scores all data using each partition's model

Deploy a Scikit-learn Regressor

from sklearn.ensemble import RandomForestRegressor

steps = [
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y1',
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'regression'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()

Deploy Anomaly Detection (OneClassSVM)

from sklearn.svm import OneClassSVM

steps = [
    ('scaler', StandardScaler()),
    ('anomaly', OneClassSVM(nu=0.05))
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'anomaly_detection'},
    skl_pipeline_steps=steps,
    model_parameters={"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()
# Output: anomaly flag (1=inlier, -1=outlier), decision_function, anomaly_score

Deploy LassoLarsCV (Feature Selection + Regression)

from sklearn.linear_model import LassoLarsCV

steps = [
    ('scaler', StandardScaler()),
    ('lasso', LassoLarsCV())
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'lasso_regression'},
    skl_pipeline_steps=steps,
    model_parameters={"target": 'Y1', "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()

Retrieve Predictions

# Denormalized view: predictions joined with original features
predictions = model.get_model_predictions()

# Raw (normalized) predictions table
predictions_raw = model.get_model_predictions(denormalized_view=False)

# Include per-batch timing columns (SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME,
# PROCESS_TIME, PRINT_TIME, BATCH_NO) — useful for diagnosing per-partition slowness
predictions_with_timing = model.get_model_predictions(include_timing=True)

# Trained model metadata
trained_models = model.get_trained_models()

Inline HTML Reports

Pass report=True to train() or score() to render an inline HTML report with partition-duration histograms, top-10 longest/fastest partitions, input/output row-count distributions, and (for training) a model-size histogram:

model.train(report=True)
model.score(report=True)

Inspect the Underlying SQL

# View the generated Script Table Operator SQL
print(model.mapper_scoring.generate_sto_query())

Reload an Existing Hyper-segmented Model

# List all registered hyper-models
sto.list_hyper_models()

# Reload by UUID
existing_model = HyperModel(tdstone=sto)
existing_model.download(id='0286d259-ecde-4cd0-ae4a-bcb3191383d1')

# Retrain and rescore (no code needed — everything is stored in Vantage)
existing_model.train()
existing_model.score()

Feature Engineering

Dimensionality Reduction (Reducer)

from tdstone2.tdsfeature_engineering import FeatureEngineering

fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering reducer',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    metadata={'project': 'pca_reduce'},
)
fe.reduce()
reduced = fe.get_reduced_features()

Custom Feature Engineering

Provide a Python script that computes new columns from existing ones:

fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering',
    script_path='path/to/Feature_Interactions.py',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    metadata={'project': 'interactions'},
)
fe.transform()

# All original + derived features denormalized
features = fe.get_computed_features()

# Raw (normalized) features table
features_raw = fe.get_computed_features(denormalized_view=False)

Vector Embeddings

Install a HuggingFace Model (ONNX)

from tdstone2.tdsgenai import (
    install_model_in_vantage_from_name,
    customize_existing_model,
    install_zip_in_vantage,
    list_installed_files_byom,
    get_model_dimension,
)

# Download, convert to ONNX, customize tokenizer, upload
install_model_in_vantage_from_name(
    model_name='intfloat/multilingual-e5-small',
    model_task='sentence-similarity',
    upload=True,
    generate_zip=True,
)

list_installed_files_byom()  # verify installation
get_model_dimension(model_name='intfloat/multilingual-e5-small')  # e.g., 384

For Vantage Cloud Lake (OAF):

from tdstone2.tdsgenai_lake import install_model_in_vantage_from_name
install_model_in_vantage_from_name(model_name='intfloat/multilingual-e5-small', ...)

Compute embeddings on Vantage Cloud Lake (OAF)

from tdstone2.tdsgenai_lake import compute_vector_embedding

# device='cuda' triggers a pre-flight Apply probe that verifies torch.cuda.is_available()
# in the OAF env *before* launching the real call. With cuda_strict=True (default), the
# call raises RuntimeError with an actionable diagnostic instead of silently falling back
# to CPU when torch is built for a different CUDA major than the cluster's NVIDIA driver
# supports (e.g. cu13 wheel vs CUDA 12 driver).
embeddings = compute_vector_embedding(
    model_name         = 'models--BAAI--bge-small-en-v1.5',
    dataset            = text_dataframe,
    text_column        = 'txt',
    accumulate_columns = ['id'],
    hash_columns       = ['id'],
    oaf_env            = 'bq20251218',         # see OAF env setup below
    schema_name        = Param['database'],
    table_name         = 'embeddings_bge',
    device             = 'cuda',
    cuda_strict        = True,   # raise on real CUDA failure; pass-through on Apply throttling
    compute_group      = 'GPU_Cluster',        # routes Apply to the GPU compute group
)

OAF environment setup for Tesla T4 (CUDA 12.x driver)

The OAF package mirror evolves independently of Vantage compute-node drivers. As of 2026, the mirror's default torch resolves to 2.11.0+cu130, which requires CUDA 13.0 — but current Lake GPU nodes run driver 12.7/12.8. Use an env that already carries torch 2.9.1+cu128 (e.g. bq20251218), or create a new one and install the cu128 build before any other torch package touches the env.

Required packages (install in order via tdml.get_env(env_name).install_lib([...])):

Package Why
torch==2.9.1 cu128 wheel — matches CUDA 12.x driver
sentence-transformers ST 5.x
pydantic>=2.0 ST 5.x requires pydantic v2; base image ships only v1

Known compatibility patches (already baked into tds_vector_embedding_lake.py):

  • huggingface-hub version guard — the mirror pins hub at 1.2.3 but ST 5.x requires >=1.5.0. ST checks via importlib.metadata.version, not __version__, so both must be spoofed before importing ST.
  • types.UnionType in hub validatortransformers 5.x uses Python 3.10+ str | None union syntax in its config dataclasses. Hub 1.2.3's dataclasses.py type-validator has no handler for types.UnionType and raises TypeError: Unsupported type for field 'transformers_version': str | None. Fixed by registering types.UnionType in _BASIC_TYPE_VALIDATORS before model load.

Both patches are applied automatically inside the STO script; no user action needed. pydantic v2 must be installed in the env for ST 5.x's model-config schema to load.

Validated configuration (Tesla T4, CUDA driver 12.7, VCL2 GPU_Cluster):

Component Version
torch 2.9.1+cu128
sentence-transformers 5.4.1
huggingface-hub (mirror) 1.2.3 (spoofed to 1.9.0 at runtime)
transformers 5.7.0
pydantic 2.13.3
BGE model BAAI/bge-small-en-v1.5 (384-dim)
Throughput ~1 000 rows / 69 s on a single T4

See demos/notebooks Demo OAF - Vector Embedding/ for the full provisioning notebooks.

Compute Vector Embeddings In-Database (BYOM)

from tdstone2.tdsgenai import compute_vector_embedding_byom

embeddings = compute_vector_embedding_byom(
    model='tdstone2_emb_384_intfloat_multilingual_e5_small',
    dataset=text_dataframe,
    text_column='content',
    accumulate_columns=['id'],
    schema_name=Param['database'],
    table_name='embeddings',
    primary_index=['id'],
)

Output: table with id + 384-dimensional embedding vector per row.


Seq2Seq Models (Summarization / Translation / Language Detection)

from tdstone2.tdsgenai_seq2seq import install_seq2seq_model, run_seq2seq

# Install model (e.g., flan-t5-small for summarization)
install_seq2seq_model(
    model_name='google/flan-t5-small',
    model_task='summarization',
    upload=True,
)

# Run via Script Table Operator
results = run_seq2seq(
    dataset=text_dataframe,
    text_column='content',
    model='tdstone2_seq2seq_google_flan_t5_small',
    schema_name=Param['database'],
)

Lineage & Registry

sto.list_codes()                       # registered Python scripts/classes
sto.list_models()                      # model configs with all arguments
sto.list_mappers()                     # training/scoring/feature-engineering mappers
sto.list_hyper_models()                # hyper-segmented model registrations
sto.list_feature_engineering_models()  # feature engineering pipeline registrations

Local Validation / Debugging

Execute the generated model code locally before deploying to Vantage:

code_and_data = model.get_code_and_data(partition_id=1)

exec(code_and_data['code'])
local_model = MyModel(**code_and_data['arguments'])
df_local = code_and_data['data']
df_local['flag'] = df_local['flag'].astype('category')
df_local['Y2'] = df_local['Y2'].astype('category')
local_model.fit(df_local)
local_model.score(df_local)

Demo Notebooks

The demos/ folder contains end-to-end worked examples:

Series Location Content
Core workflow demos/notebooks/ Data generation → setup → HyperModel → feature engineering → retrieval (01–16)
Scikit-learn models demos/notebooks Demo Hypermodel Scikit-Learn/ Classifier, Regressor, Anomaly, LassoLarsCV (STO path)
Scikit-learn models on VCL Apply demos/notebooks Demo Hypermodel Scikit-Learn with OAF/ Same flow on Vantage Cloud Lake via use_apply=True (VCL1 / VCL2 variants)
Script Table Operator demos/demo script csae/notebooks/ Raw STO usage with anomaly detection
BYOM Vector Embedding demos/notebooks Demo BYOM - Vector Embedding/ HuggingFace ONNX install + BYOM embedding
OAF Vector Embedding demos/notebooks Demo OAF - Vector Embedding/ HuggingFace model on OpenAF platform (VCL1 + VCL2 variants, dedicated tdstone2_embeddings env, CUDA pre-flight)
STO Vector Embedding demos/notebooks Demo STO - Vector Embedding/ Embedding via Script Table Operator
Seq2Seq demos/notebooks Demo Seq2Seq/ Summarization and language detection
Chat / Semantic Search demos/demo chat/notebooks/ Chunking, embedding, vector search

Repository Tables (Default Names)

Object Table
Code TDS_CODE_REPOSITORY
Model TDS_MODEL_REPOSITORY
Trained Model TDS_TRAINED_MODEL_REPOSITORY
Mapper TDS_MAPPER_REPOSITORY
HyperModel TDS_HYPER_MODEL_REPOSITORY
Feature Engineering TDS_FEATURE_ENGINEERING_PROCESS_REPOSITORY

All names are overridable via TDStone.__init__() kwargs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tdstone2-0.1.9.2-py3-none-any.whl (193.2 kB view details)

Uploaded Python 3

File details

Details for the file tdstone2-0.1.9.2-py3-none-any.whl.

File metadata

  • Download URL: tdstone2-0.1.9.2-py3-none-any.whl
  • Upload date:
  • Size: 193.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for tdstone2-0.1.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2e5600eb9db63db865492d20c3369f0229eab750fd4d75a9f34cc37d6fb71675
MD5 2c459593e1e79083aff47fa7fe30ac3a
BLAKE2b-256 0e039acadbf63a034f18394ea6f34c0742ad3c3ead0e5d88cdd68b7863cb1e1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page