A package for Script Table Operator that applies set theory to machine learning in Python.

Project description

`tdstone2` Package

Overview

tdstone2 operationalizes Python code for machine learning and data analysis on Teradata Vantage using the Script Table Operator (STO). It leverages Teradata's MPP architecture to run hundreds of Python scripts in parallel across hundreds of data partitions — enabling hyper-segmented model deployment with full lineage, versioning, and minimal data movement.

Features

Hyper-segmented Model Deployment: Train one independent model per data partition in parallel across Teradata AMPs
Scikit-learn Pipeline Integration: Auto-generate STO scripts from sklearn Pipelines (classifier, regressor, anomaly detection)
Feature Engineering: Deploy custom or reducer-based feature engineering per partition
Vector Embeddings: Install HuggingFace/ONNX models and compute embeddings in-database
Seq2Seq Inference: Deploy summarization and translation models (e.g., flan-t5) via STO
Model Lineage & Versioning: Temporal tables track every version of every trained model
Two Execution Backends: Script Table Operator (Vantage Enterprise + Vantage Cloud Enterprise) and Apply (Vantage Cloud Lake / OAF) — same HyperModel API, picked via use_apply=True
Per-batch Timing Instrumentation: every scored row carries SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME, PROCESS_TIME, PRINT_TIME, BATCH_NO so you can diagnose which phase dominated wall-clock, per partition, with a single SELECT
HTML Reports: train(report=True) / score(report=True) emit inline reports — partition-duration histograms, top-10 longest/fastest, input/output row-count distributions, model-size histogram

Installation

pip install tdstone2

Requires access to a Teradata Vantage system with the Script Table Operator enabled.

Quick Start

1. Generate Test Data

import teradataml as tdml
from tdstone2.dataset_generation import GenerateEquallyDistributedDataSet

tdml.create_context(**Param)  # Param = {'host': ..., 'user': ..., 'password': ...}

# Generate a synthetic partitioned dataset (21.6M rows, 216 partitions)
dataset = GenerateEquallyDistributedDataSet(n_partitions=216, n_rows=100000)
dataset.to_sql('dataset_00', schema_name=Param['database'])

Output schema: Partition_ID, ID, X1–X9 (features), Y1, Y2 (targets), flag, FOLD

2. Initialize the Framework

from tdstone2.tdstone import TDStone

sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'])
sto.setup()  # creates repository tables + installs STO files in Vantage

For Vantage Cloud Lake (OpenAF / STO-via-OAF):

sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'], oaf_env='my_env')
sto.setup()

For Vantage Cloud Lake via the Apply backend (recommended on Lake — STO is not always available there):

sto = TDStone(
    schema_name    = Param['database'],
    use_apply      = True,                  # route train/score/FE through tdml.Apply
    apply_env_name = 'tdstone2_sklearn',    # OAF user-environment with sklearn installed
    compute_group  = 'CG_BusGrpB_ANL',      # sets QueryBand 'compute=...' for ACC routing
    connect_kwargs = {                      # explicit cluster pinning (avoids the
        'host':     Param['host'],          # multi-cluster trap when several VCL_*_HOST
        'user':     Param['user'],          # env vars are set at once)
        'password': Param['password'],
        'database': Param['database'],
    },
)
sto.setup()  # creates OFS-resident repository tables + installs Apply scripts in the env

The Apply path stores all repository tables with STORAGE = TD_OFSSTORAGE, replaces the PERIOD FOR ValidPeriod temporal column with a plain CREATION_DATE (OFS rejects temporal periods), and runs the analog tds_*_apply.py scripts inside the user environment. The public HyperModel / FeatureEngineering API is identical to the STO path.

Hyper-segmented Models

Deploy a Scikit-learn Classifier

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from tdstone2.tdshypermodel import HyperModel

steps = [
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y2',
    "column_categorical": ['flag', 'Y2'],
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'flag']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'my_project'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    convert_to_onnx=False,   # set True for ONNX export
    store_pickle=True,
)

model.train()   # trains 216 independent models in parallel (one per partition)
model.score()   # scores all data using each partition's model

Deploy a Scikit-learn Regressor

from sklearn.ensemble import RandomForestRegressor

steps = [
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y1',
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'regression'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()

Deploy Anomaly Detection (OneClassSVM)

from sklearn.svm import OneClassSVM

steps = [
    ('scaler', StandardScaler()),
    ('anomaly', OneClassSVM(nu=0.05))
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'anomaly_detection'},
    skl_pipeline_steps=steps,
    model_parameters={"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()
# Output: anomaly flag (1=inlier, -1=outlier), decision_function, anomaly_score

Deploy LassoLarsCV (Feature Selection + Regression)

from sklearn.linear_model import LassoLarsCV

steps = [
    ('scaler', StandardScaler()),
    ('lasso', LassoLarsCV())
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'lasso_regression'},
    skl_pipeline_steps=steps,
    model_parameters={"target": 'Y1', "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()

Retrieve Predictions

# Denormalized view: predictions joined with original features
predictions = model.get_model_predictions()

# Raw (normalized) predictions table
predictions_raw = model.get_model_predictions(denormalized_view=False)

# Include per-batch timing columns (SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME,
# PROCESS_TIME, PRINT_TIME, BATCH_NO) — useful for diagnosing per-partition slowness
predictions_with_timing = model.get_model_predictions(include_timing=True)

# Trained model metadata
trained_models = model.get_trained_models()

Inline HTML Reports

Pass report=True to train() or score() to render an inline HTML report with partition-duration histograms, top-10 longest/fastest partitions, input/output row-count distributions, and (for training) a model-size histogram:

model.train(report=True)
model.score(report=True)

Inspect the Underlying SQL

# View the generated Script Table Operator SQL
print(model.mapper_scoring.generate_sto_query())

Reload an Existing Hyper-segmented Model

# List all registered hyper-models
sto.list_hyper_models()

# Reload by UUID
existing_model = HyperModel(tdstone=sto)
existing_model.download(id='0286d259-ecde-4cd0-ae4a-bcb3191383d1')

# Retrain and rescore (no code needed — everything is stored in Vantage)
existing_model.train()
existing_model.score()

Feature Engineering

Dimensionality Reduction (Reducer)

from tdstone2.tdsfeature_engineering import FeatureEngineering

fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering reducer',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    metadata={'project': 'pca_reduce'},
)
fe.reduce()
reduced = fe.get_reduced_features()

Custom Feature Engineering

Provide a Python script that computes new columns from existing ones:

fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering',
    script_path='path/to/Feature_Interactions.py',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    metadata={'project': 'interactions'},
)
fe.transform()

# All original + derived features denormalized
features = fe.get_computed_features()

# Raw (normalized) features table
features_raw = fe.get_computed_features(denormalized_view=False)

Vector Embeddings

Install a HuggingFace Model (ONNX)

from tdstone2.tdsgenai import (
    install_model_in_vantage_from_name,
    customize_existing_model,
    install_zip_in_vantage,
    list_installed_files_byom,
    get_model_dimension,
)

# Download, convert to ONNX, customize tokenizer, upload
install_model_in_vantage_from_name(
    model_name='intfloat/multilingual-e5-small',
    model_task='sentence-similarity',
    upload=True,
    generate_zip=True,
)

list_installed_files_byom()  # verify installation
get_model_dimension(model_name='intfloat/multilingual-e5-small')  # e.g., 384

For Vantage Cloud Lake (OAF):

from tdstone2.tdsgenai_lake import install_model_in_vantage_from_name
install_model_in_vantage_from_name(model_name='intfloat/multilingual-e5-small', ...)

Compute embeddings on Vantage Cloud Lake (OAF)

from tdstone2.tdsgenai_lake import compute_vector_embedding

# device='cuda' triggers a pre-flight Apply probe that verifies torch.cuda.is_available()
# in the OAF env *before* launching the real call. With cuda_strict=True (default), the
# call raises RuntimeError with an actionable diagnostic instead of silently falling back
# to CPU when torch is built for a different CUDA major than the cluster's NVIDIA driver
# supports (e.g. cu13 wheel vs CUDA 12 driver).
embeddings = compute_vector_embedding(
    model_name         = 'models--BAAI--bge-small-en-v1.5',
    dataset            = text_dataframe,
    text_column        = 'txt',
    accumulate_columns = ['id'],
    hash_columns       = ['id'],
    oaf_env            = 'bq20251218',         # see OAF env setup below
    schema_name        = Param['database'],
    table_name         = 'embeddings_bge',
    device             = 'cuda',
    cuda_strict        = True,   # raise on real CUDA failure; pass-through on Apply throttling
    compute_group      = 'GPU_Cluster',        # routes Apply to the GPU compute group
)

OAF environment setup for Tesla T4 (CUDA 12.x driver)

The OAF package mirror evolves independently of Vantage compute-node drivers. As of 2026, the mirror's default torch resolves to 2.11.0+cu130, which requires CUDA 13.0 — but current Lake GPU nodes run driver 12.7/12.8. Use an env that already carries torch 2.9.1+cu128 (e.g. bq20251218), or create a new one and install the cu128 build before any other torch package touches the env.

Required packages (install in order via tdml.get_env(env_name).install_lib([...])):

Package	Why
`torch==2.9.1`	cu128 wheel — matches CUDA 12.x driver
`sentence-transformers`	ST 5.x
`pydantic>=2.0`	ST 5.x requires pydantic v2; base image ships only v1

Known compatibility patches (already baked into tds_vector_embedding_lake.py):

huggingface-hub version guard — the mirror pins hub at 1.2.3 but ST 5.x requires >=1.5.0. ST checks via importlib.metadata.version, not __version__, so both must be spoofed before importing ST.
types.UnionType in hub validator — transformers 5.x uses Python 3.10+ str | None union syntax in its config dataclasses. Hub 1.2.3's dataclasses.py type-validator has no handler for types.UnionType and raises TypeError: Unsupported type for field 'transformers_version': str | None. Fixed by registering types.UnionType in _BASIC_TYPE_VALIDATORS before model load.

Both patches are applied automatically inside the STO script; no user action needed. pydantic v2 must be installed in the env for ST 5.x's model-config schema to load.

Validated configuration (Tesla T4, CUDA driver 12.7, VCL2 GPU_Cluster):

Component	Version
torch	2.9.1+cu128
sentence-transformers	5.4.1
huggingface-hub (mirror)	1.2.3 (spoofed to 1.9.0 at runtime)
transformers	5.7.0
pydantic	2.13.3
BGE model	`BAAI/bge-small-en-v1.5` (384-dim)
Throughput	~1 000 rows / 69 s on a single T4

See demos/notebooks Demo OAF - Vector Embedding/ for the full provisioning notebooks.

Compute Vector Embeddings In-Database (BYOM)

from tdstone2.tdsgenai import compute_vector_embedding_byom

embeddings = compute_vector_embedding_byom(
    model='tdstone2_emb_384_intfloat_multilingual_e5_small',
    dataset=text_dataframe,
    text_column='content',
    accumulate_columns=['id'],
    schema_name=Param['database'],
    table_name='embeddings',
    primary_index=['id'],
)

Output: table with id + 384-dimensional embedding vector per row.

Seq2Seq Models (Summarization / Translation / Language Detection)

from tdstone2.tdsgenai_seq2seq import install_seq2seq_model, run_seq2seq

# Install model (e.g., flan-t5-small for summarization)
install_seq2seq_model(
    model_name='google/flan-t5-small',
    model_task='summarization',
    upload=True,
)

# Run via Script Table Operator
results = run_seq2seq(
    dataset=text_dataframe,
    text_column='content',
    model='tdstone2_seq2seq_google_flan_t5_small',
    schema_name=Param['database'],
)

Lineage & Registry

sto.list_codes()                       # registered Python scripts/classes
sto.list_models()                      # model configs with all arguments
sto.list_mappers()                     # training/scoring/feature-engineering mappers
sto.list_hyper_models()                # hyper-segmented model registrations
sto.list_feature_engineering_models()  # feature engineering pipeline registrations

Local Validation / Debugging

Execute the generated model code locally before deploying to Vantage:

code_and_data = model.get_code_and_data(partition_id=1)

exec(code_and_data['code'])
local_model = MyModel(**code_and_data['arguments'])
df_local = code_and_data['data']
df_local['flag'] = df_local['flag'].astype('category')
df_local['Y2'] = df_local['Y2'].astype('category')
local_model.fit(df_local)
local_model.score(df_local)

Demo Notebooks

The demos/ folder contains end-to-end worked examples:

Series	Location	Content
Core workflow	`demos/notebooks/`	Data generation → setup → HyperModel → feature engineering → retrieval (01–16)
Scikit-learn models	`demos/notebooks Demo Hypermodel Scikit-Learn/`	Classifier, Regressor, Anomaly, LassoLarsCV (STO path)
Scikit-learn models on VCL Apply	`demos/notebooks Demo Hypermodel Scikit-Learn with OAF/`	Same flow on Vantage Cloud Lake via `use_apply=True` (VCL1 / VCL2 variants)
Script Table Operator	`demos/demo script csae/notebooks/`	Raw STO usage with anomaly detection
BYOM Vector Embedding	`demos/notebooks Demo BYOM - Vector Embedding/`	HuggingFace ONNX install + BYOM embedding
OAF Vector Embedding	`demos/notebooks Demo OAF - Vector Embedding/`	HuggingFace model on OpenAF platform (VCL1 + VCL2 variants, dedicated `tdstone2_embeddings` env, CUDA pre-flight)
STO Vector Embedding	`demos/notebooks Demo STO - Vector Embedding/`	Embedding via Script Table Operator
Seq2Seq	`demos/notebooks Demo Seq2Seq/`	Summarization and language detection
Chat / Semantic Search	`demos/demo chat/notebooks/`	Chunking, embedding, vector search

Repository Tables (Default Names)

Object	Table
Code	`TDS_CODE_REPOSITORY`
Model	`TDS_MODEL_REPOSITORY`
Trained Model	`TDS_TRAINED_MODEL_REPOSITORY`
Mapper	`TDS_MAPPER_REPOSITORY`
HyperModel	`TDS_HYPER_MODEL_REPOSITORY`
Feature Engineering	`TDS_FEATURE_ENGINEERING_PROCESS_REPOSITORY`

All names are overridable via TDStone.__init__() kwargs.

Project details

Release history Release notifications | RSS feed

0.1.9.3

May 4, 2026

This version

0.1.9.2

May 4, 2026

0.1.9.1

May 3, 2026

0.1.9.0

May 3, 2026

0.1.8.2

Oct 17, 2025

0.1.8.1

Jun 10, 2025

0.1.8.0

Feb 28, 2025

0.1.7.3

Feb 7, 2025

0.1.7.2

Jan 22, 2025

0.1.7.1

Jan 22, 2025

0.1.7.0

Jan 22, 2025

0.1.6.9

Jan 21, 2025

0.1.6.8

Jan 21, 2025

0.1.6.7

Jan 21, 2025

0.1.6.6

Jan 21, 2025

0.1.6.5

Jan 21, 2025

0.1.6.4

Jan 21, 2025

0.1.6.2

Jan 21, 2025

0.1.6.1

Jan 21, 2025

0.1.6.0

Jan 20, 2025

0.1.5.10

Jan 20, 2025

0.1.5.9

Jan 20, 2025

0.1.5.8

Jan 20, 2025

0.1.5.7

Jan 20, 2025

0.1.5.6

Jan 20, 2025

0.1.5.5

Jan 18, 2025

0.1.5.4

Jan 18, 2025

0.1.5.0

Dec 20, 2024

0.1.4.23

Dec 20, 2024

0.1.4.22

Dec 20, 2024

0.1.3.21

Nov 27, 2024

0.1.3.20

Nov 27, 2024

0.1.3.19

Nov 22, 2024

0.1.3.18

Nov 20, 2024

0.1.3.17

Nov 20, 2024

0.1.3.16

Nov 15, 2024

0.1.3.15

Nov 7, 2024

0.1.3.14

Nov 7, 2024

0.1.3.13

Oct 28, 2024

0.1.3.12

Oct 28, 2024

0.1.3.11

Oct 15, 2024

0.1.3.10

Oct 11, 2024

0.1.3.9

Oct 9, 2024

0.1.3.8

Oct 9, 2024

0.1.3.7

Oct 8, 2024

0.1.3.6

Oct 8, 2024

0.1.3.5

Oct 8, 2024

0.1.3.4

Oct 8, 2024

0.1.3.2

Oct 7, 2024

0.1.3.1

Oct 5, 2024

0.1.3.0

Oct 4, 2024

0.1.2.16

Jul 24, 2024

0.1.2.15

Jul 23, 2024

0.1.2.14

Jul 23, 2024

0.1.2.13

Jul 22, 2024

0.1.2.12

Apr 4, 2024

0.1.2.11

Apr 2, 2024

0.1.2.10

Apr 2, 2024

0.1.2.9

Apr 2, 2024

0.1.2.8

Mar 28, 2024

0.1.2.7

Mar 27, 2024

0.1.2.6

Feb 22, 2024

0.1.2.5

Feb 22, 2024

0.1.2.4

Feb 22, 2024

0.1.2.3

Oct 11, 2023

0.1.2.2

Oct 10, 2023

0.1.2.1

Oct 5, 2023

0.1.2.0

Oct 5, 2023

0.1.0.3

Aug 31, 2023

0.1.0.2

Aug 31, 2023

0.1.0.1

Aug 31, 2023

0.1.0

Aug 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tdstone2-0.1.9.2-py3-none-any.whl (193.2 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file tdstone2-0.1.9.2-py3-none-any.whl.

File metadata

Download URL: tdstone2-0.1.9.2-py3-none-any.whl
Upload date: May 4, 2026
Size: 193.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for tdstone2-0.1.9.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e5600eb9db63db865492d20c3369f0229eab750fd4d75a9f34cc37d6fb71675`
MD5	`2c459593e1e79083aff47fa7fe30ac3a`
BLAKE2b-256	`0e039acadbf63a034f18394ea6f34c0742ad3c3ead0e5d88cdd68b7863cb1e1d`

See more details on using hashes here.

tdstone2 0.1.9.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tdstone2 Package

Overview

Features

Installation

Quick Start

1. Generate Test Data

2. Initialize the Framework

Hyper-segmented Models

Deploy a Scikit-learn Classifier

Deploy a Scikit-learn Regressor

Deploy Anomaly Detection (OneClassSVM)

Deploy LassoLarsCV (Feature Selection + Regression)

Retrieve Predictions

Inline HTML Reports

Inspect the Underlying SQL

Reload an Existing Hyper-segmented Model

Feature Engineering

Dimensionality Reduction (Reducer)

Custom Feature Engineering

Vector Embeddings

Install a HuggingFace Model (ONNX)

Compute embeddings on Vantage Cloud Lake (OAF)

OAF environment setup for Tesla T4 (CUDA 12.x driver)

Compute Vector Embeddings In-Database (BYOM)

Seq2Seq Models (Summarization / Translation / Language Detection)

Lineage & Registry

Local Validation / Debugging

Demo Notebooks

Repository Tables (Default Names)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`tdstone2` Package