A python package to simplify the usage of feature store using Teradata Vantage ...
Project description
tdfs4ds — A Feature Store Library for Data Scientists working with ClearScape Analytics
tdfs4ds (Teradata Feature Store for Data Scientists) is a Python package for managing temporal feature stores in Teradata Vantage databases. It provides easy-to-use functions for creating, registering, storing, and retrieving features — with full time-travel support, lineage tracking, and process operationalization.
Installation
pip install tdfs4ds
Quick Start
Import tdfs4ds after establishing a teradataml connection so the package can auto-detect your default database:
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)
import tdfs4ds
# tdfs4ds.SCHEMA is auto-set from the teradataml context;
# override if needed: tdfs4ds.SCHEMA = 'my_database'
# Data domain management — use the dedicated functions:
tdfs4ds.create_data_domain('MY_PROJECT') # create and activate a new domain
# or
tdfs4ds.select_data_domain('MY_PROJECT') # activate an existing domain
# or
tdfs4ds.get_data_domains() # list all available domains (* marks the active one)
Core API
| Function | Description |
|---|---|
tdfs4ds.setup(database) |
Create feature catalog, process catalog, and follow-up tables in database |
tdfs4ds.upload_features(df, entity_id, feature_names, metadata={}) |
Ingest features from a teradataml DataFrame into the feature store |
tdfs4ds.build_dataset(entity_id, selected_features, view_name, comment='dataset') |
Assemble a dataset view from registered features |
tdfs4ds.run(process_id) |
Re-execute a registered feature engineering process |
tdfs4ds.roll_out(...) |
Operationalize processes at scale |
tdfs4ds.connect(database) |
Connect to an existing feature store |
entity_id must specify SQL data types (dict, not list)
entity_id = {'CUSTOMER_ID': 'BIGINT', 'EVENT_DATE': 'DATE'} # correct
entity_id = ['CUSTOMER_ID', 'EVENT_DATE'] # wrong
Walkthrough Example
Step 1 — Set up a feature store
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)
import tdfs4ds
tdfs4ds.setup(database='my_database')
Step 2 — Configure the active context
tdfs4ds.SCHEMA = 'my_database' # override if not auto-detected
# Use dedicated functions to manage the data domain:
tdfs4ds.create_data_domain('DATA_QUALITY') # create and activate (first time)
# tdfs4ds.select_data_domain('DATA_QUALITY') # activate an existing domain
# tdfs4ds.get_data_domains() # list all domains
Step 3 — Define your feature engineering view
df = tdml.DataFrame(tdml.in_schema('my_database', 'my_feature_view'))
# If teradataml created intermediate views, make them permanent first:
# tdfs4ds.crystallize_view(df)
Step 4 — Upload and operationalize
entity_id = {'EVENT_DT': 'DATE', 'ID': 'BIGINT'}
feature_names = ['KPI1', 'KPI2']
tdfs4ds.upload_features(
df=df,
entity_id=entity_id,
feature_names=feature_names,
metadata={'project': 'data quality'}
)
This registers entities and features (if not already present), registers a feature engineering process in the process catalog, and writes the feature values into the feature store.
Step 5 — Re-run a process
# List all registered processes to find the process ID
tdfs4ds.process_catalog()
# Re-execute by process ID
tdfs4ds.run(process_id)
Step 6 — Build a dataset
selected_features = {
'KPI1': '<process_uuid>',
'KPI2': '<process_uuid>',
}
dataset = tdfs4ds.build_dataset(
entity_id={'ID': 'BIGINT'},
selected_features=selected_features,
view_name='my_dataset',
comment='Dataset for churn model'
)
selected_features maps each feature name to the UUID of the process that computed it.
Configuration
Programmatic (in-session)
tdfs4ds.SCHEMA = 'my_database' # target database (auto-set from context)
# Data domain: use tdfs4ds.create_data_domain() / select_data_domain() / get_data_domains()
tdfs4ds.FEATURE_STORE_TIME = None # None = current; '2024-01-01 00:00:00' = time travel
tdfs4ds.DISPLAY_LOGS = True # verbose logging
tdfs4ds.DEBUG_MODE = False
tdfs4ds.STORE_FEATURE = 'MERGE' # 'MERGE' or 'UPDATE_INSERT'
# GenAI documentation
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai' # or 'bedrock'
tdfs4ds.INSTRUCT_MODEL_MODEL = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY = 'sk-...' # prefer env var instead (see below)
Config file (persistent per-project or per-user)
Create a tdfs4ds.json file in your project directory (or ~/.tdfs4ds/config.json for user-wide defaults) to avoid repeating the setup cell in every notebook:
{
"schema": "MY_DATABASE",
"data_domain": "MY_PROJECT",
"display_logs": true,
"store_feature": "MERGE",
"varchar_size": 1024,
"instruct_model_provider": "openai",
"instruct_model_model": "gpt-4o",
"instruct_model_url": null
}
Keys are case-insensitive. instruct_model_api_key is rejected from JSON config to prevent accidental commits — use a .env file or OS env var for credentials.
.env file (local secrets and overrides)
Place a .env file in your project directory (or ~/.tdfs4ds/.env for user-wide defaults). Only TDFS4DS_* variables are read — the file is parsed without touching os.environ:
TDFS4DS_SCHEMA=MY_DATABASE
TDFS4DS_DATA_DOMAIN=MY_PROJECT
TDFS4DS_INSTRUCT_MODEL_API_KEY=sk-...
TDFS4DS_INSTRUCT_MODEL_PROVIDER=openai
TDFS4DS_INSTRUCT_MODEL_MODEL=gpt-4o
Add .env to .gitignore to keep secrets out of source control. Quoted values and export KEY=VALUE syntax are supported.
Environment variables
All settings can also be set via TDFS4DS_<VAR_NAME> OS environment variables (useful in CI/CD):
| Variable | Corresponding setting |
|---|---|
TDFS4DS_SCHEMA |
tdfs4ds.SCHEMA |
TDFS4DS_DATA_DOMAIN |
tdfs4ds.DATA_DOMAIN |
TDFS4DS_DISPLAY_LOGS |
tdfs4ds.DISPLAY_LOGS |
TDFS4DS_DEBUG_MODE |
tdfs4ds.DEBUG_MODE |
TDFS4DS_STORE_FEATURE |
tdfs4ds.STORE_FEATURE |
TDFS4DS_VARCHAR_SIZE |
tdfs4ds.VARCHAR_SIZE |
TDFS4DS_INSTRUCT_MODEL_PROVIDER |
tdfs4ds.INSTRUCT_MODEL_PROVIDER |
TDFS4DS_INSTRUCT_MODEL_MODEL |
tdfs4ds.INSTRUCT_MODEL_MODEL |
TDFS4DS_INSTRUCT_MODEL_URL |
tdfs4ds.INSTRUCT_MODEL_URL |
TDFS4DS_INSTRUCT_MODEL_API_KEY |
tdfs4ds.INSTRUCT_MODEL_API_KEY |
load_config() — explicit reload
# Reload from default search paths
tdfs4ds.load_config()
# Point at specific files
tdfs4ds.load_config(
path='/configs/feature_store.json',
dotenv_path='/project/.env.production',
)
Priority chain
programmatic (tdfs4ds.X = value)
> OS environment variable (TDFS4DS_X)
> .env file (./.env or ~/.tdfs4ds/.env)
> JSON config file (./tdfs4ds.json or ~/.tdfs4ds/config.json)
> teradataml auto-detection (SCHEMA only)
> built-in defaults
Time Travel
All catalogs and feature stores are temporal. Point-in-time queries are available via:
tdfs4ds.FEATURE_STORE_TIME = '2024-01-01 00:00:00' # query historical state
tdfs4ds.FEATURE_STORE_TIME = None # back to current state
Package Structure
tdfs4ds/
├── __init__.py — Global config variables & re-exported public API
├── config.py — External config loading (JSON, .env, env vars); load_config()
├── lifecycle.py — setup(), connect()
├── execution.py — run(), upload_features(), roll_out()
├── catalog.py — feature_catalog(), process_catalog(), dataset_catalog()
├── data_domain.py — get_data_domains(), select_data_domain(), create_data_domain()
├── datasets.py — Utility dataset helpers
├── feature_store/
│ ├── entity_management.py — register_entity(), remove_entity()
│ ├── feature_data_processing.py — prepare_feature_ingestion(), store_feature(), apply_collect_stats()
│ ├── feature_query_retrieval.py — get_list_features(), get_available_features(), get_feature_versions()
│ └── feature_store_management.py — register_features(), feature_store_table_creation()
├── process_store/
│ ├── process_followup.py — followup_open(), followup_close(), follow_up_report()
│ ├── process_query_administration.py — list_processes(), get_process_id(), remove_process()
│ ├── process_registration_management.py — register_process_view()
│ └── process_store_catalog_management.py — process_store_catalog_creation()
├── dataset/
│ ├── builder.py — build_dataset(), build_dataset_opt(), augment_source_with_features()
│ ├── dataset.py — Dataset class
│ └── dataset_catalog.py — DatasetCatalog class
├── genai/
│ └── documentation.py — LLM-powered auto-documentation of SQL processes (OpenAI / Bedrock)
├── lineage/
│ ├── lineage.py — SQL query parsing, DDL analysis
│ ├── network.py — Dependency graph construction
│ └── indexing.py — Lineage indexing utilities
└── utils/
├── query_management.py — execute_query(), execute_query_wrapper()
├── filter_management.py — FilterManager class
├── time_management.py — TimeManager class
├── lineage.py — crystallize_view(), analyze_sql_query(), generate_view_dependency_network()
├── info.py — update_varchar_length(), get_column_types(), seconds_to_dhms()
└── visualization.py — plot_graph(), visualize_graph(), display_table()
GenAI Documentation
The genai module provides two complementary ways to document the feature store.
LLM-powered process documentation
document_process() calls an LLM (OpenAI, Azure, vLLM, or AWS Bedrock) to generate:
- Business-logic description of the SQL query
- Entity description and per-column annotations
- EXPLAIN-plan quality score (1–5) with warnings and recommendations
import tdfs4ds
from tdfs4ds.genai import document_process
# Configure the LLM (or use TDFS4DS_INSTRUCT_MODEL_* env vars / .env file)
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'
tdfs4ds.INSTRUCT_MODEL_MODEL = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY = 'sk-...'
process_info = document_process(process_id='<UUID>', show_explain_plan=True)
Business dictionary (no LLM required)
Two temporal tables store business-oriented descriptions for any database object or column — independently of the process documentation workflow.
| Table | Purpose |
|---|---|
FS_BUSINESS_DICTIONARY_OBJECTS |
One description per table or view (OBJECT_TYPE: 'T'/'V') |
FS_BUSINESS_DICTIONARY_COLUMNS |
One description per column |
Both tables are VALIDTIME temporal and provisioned automatically by tdfs4ds.connect(create_if_missing=True).
import pandas as pd
from tdfs4ds.genai import upload_business_dictionary_objects, upload_business_dictionary_columns
# Object-level descriptions
upload_business_dictionary_objects(pd.DataFrame([
{
'DATABASE_NAME' : 'MY_DB',
'OBJECT_NAME' : 'CUSTOMER',
'OBJECT_TYPE' : 'T',
'BUSINESS_DESCRIPTION': 'Core customer table. Each row represents a unique enrolled customer.',
},
{
'DATABASE_NAME' : 'MY_DB',
'OBJECT_NAME' : 'V_CUSTOMER_ORDERS',
'OBJECT_TYPE' : 'V',
'BUSINESS_DESCRIPTION': 'View joining customers with their order history for reporting.',
},
]))
# Column-level descriptions
upload_business_dictionary_columns(pd.DataFrame([
{
'DATABASE_NAME' : 'MY_DB',
'TABLE_NAME' : 'CUSTOMER',
'COLUMN_NAME' : 'CUSTOMER_ID',
'BUSINESS_DESCRIPTION': 'Unique customer identifier assigned at enrolment.',
},
{
'DATABASE_NAME' : 'MY_DB',
'TABLE_NAME' : 'CUSTOMER',
'COLUMN_NAME' : 'BIRTH_DATE',
'BUSINESS_DESCRIPTION': 'Customer date of birth, used for age-band segmentation.',
},
]))
Both functions validate that all required columns are present and perform a CURRENT VALIDTIME MERGE — re-running them updates existing descriptions and preserves the full change history.
Discover Registered Features
from tdfs4ds.feature_store.feature_query_retrieval import (
get_list_entity,
get_list_features,
get_available_features,
get_feature_versions,
)
Lineage
The lineage module builds end-to-end dependency graphs from a SQL query or a dataset view DDL.
Dependency graph
from tdfs4ds.lineage import build_teradata_dependency_graph, plot_lineage_sankey, show_plotly_robust
# Start from a dataset view DDL (obtained via SHOW VIEW)
sql = tdml.execute_sql("SHOW VIEW DATASET_CUSTOMER").fetchall()[0][0]
graph = build_teradata_dependency_graph(sql_query=sql)
# Returns: {"nodes": {...}, "edges": [...], "roots": [...]}
By default (expand_datasets_via_process_catalog=True) dataset nodes are resolved through
the process catalog: FEATURE_VERSION UUIDs embedded in the dataset DDL are matched to
PROCESS_ID entries in FS_V_PROCESS_CATALOG, and edges are drawn directly to the
registered feature-engineering views.
DATASET_CUSTOMER → FEAT_ENG_CUST → DB_SOURCE.TRANSACTIONS
Set expand_datasets_via_process_catalog=False to connect the dataset directly to the
raw feature-store storage tables (previous behaviour).
fig = plot_lineage_sankey(graph, title="Customer Dataset Lineage")
show_plotly_robust(fig)
Migration manifest
graph_to_migration_manifest converts any lineage graph into a flat, JSON-serialisable
dict — useful for planning a feature store migration.
from tdfs4ds.lineage import graph_to_migration_manifest
import json
# All databases
manifest = graph_to_migration_manifest(graph)
# Scoped to the feature store schema only (cross-boundary edges excluded)
manifest_fs = graph_to_migration_manifest(graph, filter_database=tdfs4ds.SCHEMA)
print(json.dumps(manifest_fs, indent=2))
# {
# "views": [{"database": "demo_user", "name": "DATASET_CUSTOMER", "type": "dataset"},
# {"database": "demo_user", "name": "FEAT_ENG_CUST", "type": "view"}],
# "tables": [],
# "edges": [{"from": "demo_user.DATASET_CUSTOMER", "to": "demo_user.FEAT_ENG_CUST"}]
# }
with open("migration_manifest.json", "w") as f:
json.dump(manifest_fs, f, indent=2)
Requirements
- Python >= 3.6
- teradataml >= 17.20
- Active Teradata Vantage connection
- VALIDTIME temporal tables must be enabled on the Teradata Vantage system — all feature catalogs, process catalogs, and feature stores rely on
VALIDTIMEsupport
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tdfs4ds-0.2.6.4-py3-none-any.whl.
File metadata
- Download URL: tdfs4ds-0.2.6.4-py3-none-any.whl
- Upload date:
- Size: 567.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4ab31f4df3028a6c5f3899751e8b40d1631e9993e994d9b729c2f6d1fe1bf43
|
|
| MD5 |
7ccbfe7723ef58b24373d6f97599a4da
|
|
| BLAKE2b-256 |
97de7ccb4b6af45b245e1c06bc92458212ea886a8ac03f7028a592787ac6a4b2
|