A python package to simplify the usage of feature store using Teradata Vantage ...
Project description
tdfs4ds — A Feature Store Library for Data Scientists working with ClearScape Analytics
tdfs4ds (Teradata Feature Store for Data Scientists) is a Python package for managing temporal feature stores in Teradata Vantage databases. It provides easy-to-use functions for creating, registering, storing, and retrieving features — with full time-travel support, lineage tracking, and process operationalization.
Installation
pip install tdfs4ds
Quick Start
Import tdfs4ds after establishing a teradataml connection so the package can auto-detect your default database:
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)
import tdfs4ds
# tdfs4ds.SCHEMA is auto-set from the teradataml context;
# override if needed: tdfs4ds.SCHEMA = 'my_database'
# Data domain management — use the dedicated functions:
tdfs4ds.create_data_domain('MY_PROJECT') # create and activate a new domain
# or
tdfs4ds.select_data_domain('MY_PROJECT') # activate an existing domain
# or
tdfs4ds.get_data_domains() # list all available domains (* marks the active one)
Core API
| Function | Description |
|---|---|
tdfs4ds.setup(database) |
Create feature catalog, process catalog, and follow-up tables in database |
tdfs4ds.upload_features(df, entity_id, feature_names, metadata={}) |
Ingest features from a teradataml DataFrame into the feature store |
tdfs4ds.build_dataset(entity_id, selected_features, view_name, comment='dataset') |
Assemble a dataset view from registered features |
tdfs4ds.run(process_id) |
Re-execute a registered feature engineering process |
tdfs4ds.roll_out(...) |
Operationalize processes at scale |
tdfs4ds.connect(database) |
Connect to an existing feature store |
entity_id must specify SQL data types (dict, not list)
entity_id = {'CUSTOMER_ID': 'BIGINT', 'EVENT_DATE': 'DATE'} # correct
entity_id = ['CUSTOMER_ID', 'EVENT_DATE'] # wrong
Walkthrough Example
Step 1 — Set up a feature store
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)
import tdfs4ds
tdfs4ds.setup(database='my_database')
Step 2 — Configure the active context
tdfs4ds.SCHEMA = 'my_database' # override if not auto-detected
# Use dedicated functions to manage the data domain:
tdfs4ds.create_data_domain('DATA_QUALITY') # create and activate (first time)
# tdfs4ds.select_data_domain('DATA_QUALITY') # activate an existing domain
# tdfs4ds.get_data_domains() # list all domains
Step 3 — Define your feature engineering view
df = tdml.DataFrame(tdml.in_schema('my_database', 'my_feature_view'))
# If teradataml created intermediate views, make them permanent first:
# tdfs4ds.crystallize_view(df)
Step 4 — Upload and operationalize
entity_id = {'EVENT_DT': 'DATE', 'ID': 'BIGINT'}
feature_names = ['KPI1', 'KPI2']
tdfs4ds.upload_features(
df=df,
entity_id=entity_id,
feature_names=feature_names,
metadata={'project': 'data quality'}
)
This registers entities and features (if not already present), registers a feature engineering process in the process catalog, and writes the feature values into the feature store.
Step 5 — Re-run a process
# List all registered processes to find the process ID
tdfs4ds.process_catalog()
# Re-execute by process ID
tdfs4ds.run(process_id)
Step 6 — Build a dataset
selected_features = {
'KPI1': '<process_uuid>',
'KPI2': '<process_uuid>',
}
dataset = tdfs4ds.build_dataset(
entity_id={'ID': 'BIGINT'},
selected_features=selected_features,
view_name='my_dataset',
comment='Dataset for churn model'
)
selected_features maps each feature name to the UUID of the process that computed it.
Configuration
Programmatic (in-session)
tdfs4ds.SCHEMA = 'my_database' # target database (auto-set from context)
# Data domain: use tdfs4ds.create_data_domain() / select_data_domain() / get_data_domains()
tdfs4ds.FEATURE_STORE_TIME = None # None = current; '2024-01-01 00:00:00' = time travel
tdfs4ds.DISPLAY_LOGS = True # verbose logging
tdfs4ds.DEBUG_MODE = False
tdfs4ds.STORE_FEATURE = 'MERGE' # 'MERGE' or 'UPDATE_INSERT'
# GenAI documentation
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai' # or 'bedrock', 'vllm', 'azure'
tdfs4ds.INSTRUCT_MODEL_MODEL = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY = 'sk-...' # prefer env var instead (see below)
# Embedding model (consumer agent vector index — falls back to INSTRUCT_MODEL_* if unset)
tdfs4ds.EMBEDDING_MODEL_PROVIDER = 'vllm'
tdfs4ds.EMBEDDING_MODEL_URL = 'https://api.example.com/v1/e5'
tdfs4ds.EMBEDDING_MODEL_MODEL = 'text-embedding-3-small'
tdfs4ds.EMBEDDING_MODEL_DIM = 1536
# Chroma vector store
tdfs4ds.CHROMA_MODE = 'local' # 'local' or 'server'
tdfs4ds.CHROMA_PATH = './tdfs4ds_chroma' # persist directory (local mode)
Config file (persistent per-project or per-user)
Create a tdfs4ds.json file in your project directory (or ~/.tdfs4ds/config.json for user-wide defaults) to avoid repeating the setup cell in every notebook:
{
"schema": "MY_DATABASE",
"data_domain": "MY_PROJECT",
"display_logs": true,
"store_feature": "MERGE",
"varchar_size": 1024,
"instruct_model_provider": "openai",
"instruct_model_model": "gpt-4o",
"instruct_model_url": null,
"embedding_model_provider": "vllm",
"embedding_model_model": "text-embedding-3-small",
"embedding_model_dim": 1536,
"chroma_mode": "local",
"chroma_path": "./tdfs4ds_chroma"
}
Keys are case-insensitive. instruct_model_api_key is rejected from JSON config to prevent accidental commits — use a .env file or OS env var for credentials.
.env file (local secrets and overrides)
Place a .env file in your project directory (or ~/.tdfs4ds/.env for user-wide defaults). Only TDFS4DS_* variables are read — the file is parsed without touching os.environ:
TDFS4DS_SCHEMA=MY_DATABASE
TDFS4DS_DATA_DOMAIN=MY_PROJECT
TDFS4DS_INSTRUCT_MODEL_API_KEY=sk-...
TDFS4DS_INSTRUCT_MODEL_PROVIDER=openai
TDFS4DS_INSTRUCT_MODEL_MODEL=gpt-4o
TDFS4DS_EMBEDDING_MODEL_PROVIDER=vllm
TDFS4DS_EMBEDDING_MODEL_URL=https://api.example.com/v1/e5
TDFS4DS_EMBEDDING_MODEL_MODEL=text-embedding-3-small
TDFS4DS_EMBEDDING_MODEL_DIM=1536
TDFS4DS_CHROMA_MODE=local
TDFS4DS_CHROMA_PATH=./tdfs4ds_chroma
Add .env to .gitignore to keep secrets out of source control. Quoted values and export KEY=VALUE syntax are supported.
Environment variables
All settings can also be set via TDFS4DS_<VAR_NAME> OS environment variables (useful in CI/CD):
| Variable | Corresponding setting |
|---|---|
TDFS4DS_SCHEMA |
tdfs4ds.SCHEMA |
TDFS4DS_DATA_DOMAIN |
tdfs4ds.DATA_DOMAIN |
TDFS4DS_DISPLAY_LOGS |
tdfs4ds.DISPLAY_LOGS |
TDFS4DS_DEBUG_MODE |
tdfs4ds.DEBUG_MODE |
TDFS4DS_STORE_FEATURE |
tdfs4ds.STORE_FEATURE |
TDFS4DS_VARCHAR_SIZE |
tdfs4ds.VARCHAR_SIZE |
TDFS4DS_INSTRUCT_MODEL_PROVIDER |
tdfs4ds.INSTRUCT_MODEL_PROVIDER |
TDFS4DS_INSTRUCT_MODEL_MODEL |
tdfs4ds.INSTRUCT_MODEL_MODEL |
TDFS4DS_INSTRUCT_MODEL_URL |
tdfs4ds.INSTRUCT_MODEL_URL |
TDFS4DS_INSTRUCT_MODEL_API_KEY |
tdfs4ds.INSTRUCT_MODEL_API_KEY |
TDFS4DS_EMBEDDING_MODEL_PROVIDER |
tdfs4ds.EMBEDDING_MODEL_PROVIDER |
TDFS4DS_EMBEDDING_MODEL_MODEL |
tdfs4ds.EMBEDDING_MODEL_MODEL |
TDFS4DS_EMBEDDING_MODEL_URL |
tdfs4ds.EMBEDDING_MODEL_URL |
TDFS4DS_EMBEDDING_MODEL_API_KEY |
tdfs4ds.EMBEDDING_MODEL_API_KEY |
TDFS4DS_EMBEDDING_MODEL_DIM |
tdfs4ds.EMBEDDING_MODEL_DIM |
TDFS4DS_CHROMA_MODE |
tdfs4ds.CHROMA_MODE |
TDFS4DS_CHROMA_PATH |
tdfs4ds.CHROMA_PATH |
TDFS4DS_CHROMA_HOST |
tdfs4ds.CHROMA_HOST |
TDFS4DS_CHROMA_PORT |
tdfs4ds.CHROMA_PORT |
load_config() — explicit reload
# Reload from default search paths
tdfs4ds.load_config()
# Point at specific files
tdfs4ds.load_config(
path='/configs/feature_store.json',
dotenv_path='/project/.env.production',
)
Priority chain
programmatic (tdfs4ds.X = value)
> OS environment variable (TDFS4DS_X)
> .env file (./.env or ~/.tdfs4ds/.env)
> JSON config file (./tdfs4ds.json or ~/.tdfs4ds/config.json)
> teradataml auto-detection (SCHEMA only)
> built-in defaults
Time Travel
All catalogs and feature stores are temporal. Point-in-time queries are available via:
tdfs4ds.FEATURE_STORE_TIME = '2024-01-01 00:00:00' # query historical state
tdfs4ds.FEATURE_STORE_TIME = None # back to current state
Package Structure
tdfs4ds/
├── __init__.py — Global config variables & re-exported public API
├── config.py — External config loading (JSON, .env, env vars); load_config()
├── lifecycle.py — setup(), connect()
├── execution.py — run(), upload_features(), roll_out()
├── catalog.py — feature_catalog(), process_catalog(), dataset_catalog()
├── data_domain.py — get_data_domains(), select_data_domain(), create_data_domain()
├── datasets.py — Utility dataset helpers
├── agent/
│ ├── __init__.py — Public exports: consumer_agent, query_optimizer, aquery_optimizer, display_optimization_result, …
│ ├── consumer_agent.py — Intent classifier, 7 skills, feature doc resolver, LLM helpers
│ ├── graph.py — LangGraph StateGraph: classify → detect_domain → skill → synthesize
│ ├── query_optimizer.py — skill_optimize_query(), query_optimizer(), aquery_optimizer(), display_optimization_result()
│ ├── embedding.py — get_embeddings(), list_embedding_models()
│ ├── vector_index.py — build_vector_index(), search_vector_index() (Chroma)
│ └── chatbot.py — launch_chatbot(), launch_chatbot_with_index() Gradio UI
├── feature_store/
│ ├── entity_management.py — register_entity(), remove_entity()
│ ├── feature_data_processing.py — prepare_feature_ingestion(), store_feature(), apply_collect_stats()
│ ├── feature_query_retrieval.py — get_list_features(), get_available_features(), get_feature_versions()
│ └── feature_store_management.py — register_features(), feature_store_table_creation()
├── process_store/
│ ├── process_followup.py — followup_open(), followup_close(), follow_up_report()
│ ├── process_query_administration.py — list_processes(), get_process_id(), remove_process()
│ ├── process_registration_management.py — register_process_view()
│ └── process_store_catalog_management.py — process_store_catalog_creation()
├── dataset/
│ ├── builder.py — build_dataset(), build_dataset_opt(), augment_source_with_features()
│ ├── dataset.py — Dataset class
│ └── dataset_catalog.py — DatasetCatalog class
├── genai/
│ └── documentation.py — LLM-powered auto-documentation of SQL processes (OpenAI / Bedrock)
├── lineage/
│ ├── lineage.py — SQL query parsing, DDL analysis
│ ├── network.py — Dependency graph construction
│ └── indexing.py — Lineage indexing utilities
└── utils/
├── query_management.py — execute_query(), execute_query_wrapper()
├── filter_management.py — FilterManager class
├── time_management.py — TimeManager class
├── lineage.py — crystallize_view(), analyze_sql_query(), generate_view_dependency_network()
├── info.py — update_varchar_length(), get_column_types(), seconds_to_dhms()
└── visualization.py — plot_graph(), visualize_graph(), display_table()
GenAI Documentation
The genai module provides two complementary ways to document the feature store.
LLM-powered process documentation
document_process() calls an LLM (OpenAI, Azure, vLLM, or AWS Bedrock) to generate:
- Business-logic description of the SQL query
- Entity description and per-column annotations
- EXPLAIN-plan quality score (1–5) with warnings and recommendations
import tdfs4ds
from tdfs4ds.genai import document_process
# Configure the LLM (or use TDFS4DS_INSTRUCT_MODEL_* env vars / .env file)
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'
tdfs4ds.INSTRUCT_MODEL_MODEL = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY = 'sk-...'
process_info = document_process(process_id='<UUID>', show_explain_plan=True)
LLM-powered dataset documentation
document_dataset_incremental() documents a dataset by walking its full lineage bottom-up:
- Source tables — uses the business dictionary if available
- Intermediate views — auto-documented via LLM if undocumented
- Process views — actively calls
document_process_incrementalif undocumented - Feature/entity column descriptions are propagated from process docs (no extra LLM call)
- A single JSON-constrained LLM call generates five structured sections for the dataset
from tdfs4ds.genai import document_dataset_incremental
result = document_dataset_incremental(
dataset_id = '<UUID>', # from dataset_catalog()
force_update = False,
upload = True,
)
# result['DATASET_SECTIONS'] contains:
# OVERVIEW, ENTITY, FEATURE_THEMES, BUSINESS_QUESTIONS, INTENDED_AUDIENCE
Each section is stored as an independent row in FS_BUSINESS_DICTIONARY_SECTIONS — no chunking needed for RAG retrieval.
Full-store documentation in one call
document_feature_store_incremental() documents every registered process and dataset in a single optimised pass. Objects are processed in dependency order (leaves first, roots last) so upstream context is always available. A shared pair of visited-sets ensures each process view is documented at most once, even when referenced by multiple datasets.
from tdfs4ds.genai import document_feature_store_incremental
summary = document_feature_store_incremental(
language = 'English',
force_update = False,
upload = True,
)
# summary keys: processes_documented, datasets_documented,
# processes_skipped, datasets_skipped
After documentation, process descriptions are automatically mirrored to the business dictionary (object overview + column-level feature descriptions) so the consumer agent can resolve them without any extra step.
Business dictionary (no LLM required)
Three temporal tables store business-oriented descriptions for any database object, its columns, and its documentation sections. They form a 3-level hierarchy designed for chunking-free hierarchical RAG:
| Level | Table | Key | Purpose |
|---|---|---|---|
| 0 | FS_BUSINESS_DICTIONARY_OBJECTS |
(DATABASE_NAME, OBJECT_NAME) |
One summary per object (OBJECT_TYPE: 'T'/'V'/'D') |
| 1 | FS_BUSINESS_DICTIONARY_SECTIONS |
(DATABASE_NAME, OBJECT_NAME, SECTION_NAME) |
One row per documentation section per object |
| 2 | FS_BUSINESS_DICTIONARY_COLUMNS |
(DATABASE_NAME, TABLE_NAME, COLUMN_NAME) |
One description per column |
All tables are VALIDTIME temporal and provisioned automatically by tdfs4ds.connect(create_if_missing=True).
import pandas as pd
from tdfs4ds.genai import (
upload_business_dictionary_objects,
upload_business_dictionary_columns,
upload_business_dictionary_sections,
)
# Level 0 — Object-level descriptions
upload_business_dictionary_objects(pd.DataFrame([
{
'DATABASE_NAME' : 'MY_DB',
'OBJECT_NAME' : 'CUSTOMER',
'OBJECT_TYPE' : 'T',
'BUSINESS_DESCRIPTION': 'Core customer table. Each row represents a unique enrolled customer.',
},
]))
# Level 1 — Section-level descriptions (typically LLM-generated for datasets)
upload_business_dictionary_sections(pd.DataFrame([
{
'DATABASE_NAME' : 'MY_DB',
'OBJECT_NAME' : 'DATASET_CUSTOMER',
'SECTION_NAME' : 'OVERVIEW',
'SECTION_CONTENT': 'Customer-level analytical dataset combining spending and category features...',
},
]))
# Level 2 — Column-level descriptions
upload_business_dictionary_columns(pd.DataFrame([
{
'DATABASE_NAME' : 'MY_DB',
'TABLE_NAME' : 'CUSTOMER',
'COLUMN_NAME' : 'CUSTOMER_ID',
'BUSINESS_DESCRIPTION': 'Unique customer identifier assigned at enrolment.',
},
]))
All three functions validate required columns and perform a CURRENT VALIDTIME MERGE — re-running them updates existing descriptions and preserves the full change history.
Consumer Agent (Chatbot)
The agent module provides a conversational interface for business consumers. Non-technical users can ask natural-language questions about features, datasets, definitions, data freshness, usage guidance, data lineage, and calculation logic — in English or French.
Architecture
User question
→ Intent classifier (Pydantic structured output)
→ DATA_DOMAIN detector (finds which domain owns the feature; remembered across turns)
→ Skill dispatcher (7 skills)
→ Plain-language answer
The agent uses LangGraph StateGraph with MemorySaver for multi-turn conversations. Conversation context is persisted across turns:
| State field | What is remembered |
|---|---|
resolved_data_domain |
Which DATA_DOMAIN owns the last named feature/dataset |
resolved_object_name |
Last explicitly named feature or dataset |
resolved_feature_triplet |
Full resolution: feature name, entity, process ID, view name |
resolved_entity_name |
Entity type in focus (e.g. CustomerID) |
resolved_feature_list |
Feature names currently in focus (one or many) |
resolved_column_sources |
Column→source-table map from the last EXPLAIN result (used by DEFINITION drill-down) |
Follow-up questions that omit an explicit feature name (e.g. "when was it last updated?", "how is it calculated?") automatically reuse the previously resolved feature, entity, and domain — no need to repeat yourself. When a feature name is shared across multiple entity types, the remembered entity silently disambiguates without asking for clarification.
After an EXPLAIN turn, the agent lists every variable involved in the formula with its source table. Asking "what is <column>?" immediately after an EXPLAIN resolves the column through the business dictionary — even if it is not a registered feature. Vague references (e.g. "what does the date mean?") are fuzzy-matched against remembered column and table names.
Feature descriptions are resolved via the process documentation chain:
entity → features → process_id → VIEW_NAME → FS_BUSINESS_DICTIONARY_COLUMNS
Quick start
import tdfs4ds
from tdfs4ds.agent import launch_chatbot_with_index
# Configure LLM and embedding model
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'vllm'
tdfs4ds.INSTRUCT_MODEL_URL = 'https://api.example.com/v1'
tdfs4ds.INSTRUCT_MODEL_API_KEY = 'my-key'
tdfs4ds.INSTRUCT_MODEL_MODEL = 'mistral-7b-instruct'
tdfs4ds.EMBEDDING_MODEL_URL = 'https://api.example.com/v1/e5'
tdfs4ds.EMBEDDING_MODEL_MODEL = 'bge-m3'
# Build vector index (incremental) then launch the Gradio chatbot — one call
demo = launch_chatbot_with_index(port=7860)
Or call the agent programmatically:
from tdfs4ds.agent import consumer_agent
answer = consumer_agent("What features are available?", thread_id="session-1")
answer = consumer_agent("How is nb_days_since_last_transactions calculated?", thread_id="session-1")
answer = consumer_agent("When was it last updated?", thread_id="session-1") # feature + entity remembered
answer = consumer_agent("What about for CustomerID?", thread_id="session-1") # entity remembered, new feature group
Skills
| Intent | Trigger examples | What happens |
|---|---|---|
SEARCH |
"What features analyse customer spending?" | Semantic search across vector index + feature catalog |
DEFINITION |
"What does total_amount measure?" | Resolves feature → process view → column doc |
USAGE |
"How do I use avg_amount in Tableau?" | Audience, granularity, regulatory guidance |
FRESHNESS |
"Is total_amount up to date?" | Checks follow-up execution history |
SUMMARY |
"List all available features" | Full feature list with entity and description per feature |
LINEAGE |
"Where does total_amount come from?" | Walks upstream dependency graph via build_teradata_dependency_graph |
EXPLAIN |
"How is total_amount calculated?" | Fetches SHOW VIEW DDL → LLM explains logic in plain language + lists source columns so you can drill into any variable |
Notebooks
09 - Consumer Agent Chatbot with tdfs4ds.ipynb— architecture walkthrough, 7-intent test suite, 4-turn multi-turn demo10 - Launch Consumer Agent Chatbot.ipynb— minimal 4-cell one-command launch
Gradio trace panel
The chatbot includes a collapsible Agent Trace accordion showing, for each turn:
- Intent Classification — detected intent, object name, domain
- DATA_DOMAIN Detection — available domains, resolved domain, source (detected / remembered)
- Skill Executed — skill name and inputs
- Skill Result — structured output summary; errors include per-step diagnostic messages
Model listing
from tdfs4ds.agent import list_instruct_models, list_embedding_models
list_instruct_models() # models on INSTRUCT_MODEL_URL
list_embedding_models(sub_paths=['e5', 'code']) # models on EMBEDDING_MODEL_URL sub-paths
Query Optimizer Agent
The query_optimizer module analyses and rewrites Teradata SQL feature-engineering queries for better performance. It is process-aware — when a process_id is supplied it pulls the registered SQL and any stored EXPLAIN documentation directly from the tdfs4ds process catalog, avoiding redundant LLM calls.
Pipeline (9 steps)
| Step | What happens |
|---|---|
| 0 | Process context — SQL + stored EXPLAIN analysis fetched from the catalog (skipped when no process_id) |
| 0.5 | SQL simplification — structural compaction pass merges unnecessary nesting layers into CTE + single outer SELECT, giving the LLM a cleaner baseline; accepts the simplified form only when its EXPLAIN score ≥ original |
| 1 | Structured EXPLAIN analysis — document_sql_query_explain scores the simplified query 1–5 with [You]-prefixed author-actionable warnings and recommendations |
| 2 | Lineage graph — Primary Index + partition columns collected for every underlying object |
| 3 | DDL fetch — SHOW TABLE / SHOW VIEW for every referenced object |
| 4 | Candidate generation — LLM proposes up to N rewrites focused on [You]-actionable items (N = tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATES, default 5) |
| 5 | Candidate EXPLAIN — document_sql_query_explain run per candidate |
| 6 | Plan comparison — LLM selects the best plan by score delta and resolved warnings |
| 7 | FilterManager check — partitioned-but-unfiltered objects are flagged for incremental processing |
| 8 | Final report — Markdown with Score Summary, Simplification section, 3-stage query comparison (Input → After Simplification → After Optimization), Candidates, Selected Optimisation, FilterManager |
Quick start
import tdfs4ds
# Configure LLM — vllm / OpenAI / Azure / Bedrock
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'vllm' # 'openai' does not require INSTRUCT_MODEL_URL
tdfs4ds.INSTRUCT_MODEL_MODEL = '...'
tdfs4ds.INSTRUCT_MODEL_API_KEY = '...'
# tdfs4ds.INSTRUCT_MODEL_URL = '...' # required for vllm/azure; omit for openai/bedrock
tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATES = 5 # max valid rewrites to evaluate (default 5)
tdfs4ds.QUERY_OPTIMIZER_MAX_FAILURES = 5 # max EXPLAIN/syntax failures before stopping (default 5)
tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATE_TOKENS = None # cap completion tokens per candidate call (None = model default; set e.g. 4096 for small-context models)
# Process-aware — SQL and stored EXPLAIN docs pulled from the catalog
result = tdfs4ds.query_optimizer(process_id='<UUID>', thread_id='session-1')
# Or pass raw SQL directly
result = tdfs4ds.query_optimizer(
sql_query="SELECT ... FROM db.tbl",
thread_id='session-1',
)
Multi-turn follow-ups sharing the same thread_id use the MemorySaver singleton — the agent retrieves the SQL from history and re-runs the pipeline with additional context:
result = tdfs4ds.query_optimizer(
"Would adding a Secondary Index on the join column improve the plan?",
thread_id='session-1',
)
Inside a Jupyter notebook use the async entry point to avoid background-thread overhead:
from tdfs4ds.agent import aquery_optimizer
result = await aquery_optimizer(process_id='<UUID>', thread_id='session-async')
Result keys
| Key | Content |
|---|---|
answer |
Structured Markdown optimisation report (Summary, Score Summary, Lineage, …) |
best_sql |
Recommended SQL (original if already optimal) |
score_delta |
Before/after score comparison — see below |
original_analysis |
Scored EXPLAIN: explanation, user_score, global_score, warnings, recommendations |
candidates |
List of candidate dicts: sql, strategy, rationale, analysis |
comparison |
Plan comparison: best_index, reasoning, business_logic_preserved |
filtermanager_applicable |
True if a FilterManager loop was recommended |
process_info |
Full process catalog record (when process_id is supplied) |
steps |
Every pipeline step with inputs and outputs |
Score comparison (score_delta)
After each optimization run, score_delta captures exactly how much the rewrite improved the query:
sd = result['score_delta']
# {
# 'optimized': True,
# 'best_strategy': 'partition_pruning',
# 'original_user_score': 2,
# 'original_global_score': 3,
# 'best_user_score': 4,
# 'best_global_score': 4,
# 'user_score_delta': 2.0, # +2 improvement
# 'global_score_delta': 1.0, # +1 improvement
# 'business_logic_preserved': True,
# }
user_score measures the quality of what the SQL author controls (1–5). global_score includes infrastructure factors (Primary Index placement, statistics, etc.). A rewrite only improves user_score, never global_score, if the bottleneck is infrastructure rather than the SQL itself.
Standalone simplification
The simplification pass can be called independently of the full optimizer:
result = tdfs4ds.simplify_query(sql_query="SELECT ...")
# or from a registered process
result = tdfs4ds.simplify_query(process_id='<UUID>')
# result keys: simplified_sql, original_sql, simplified (bool),
# original_score (1-5), simplified_score (1-5)
if result["simplified"]:
print(result["simplified_sql"])
Notebook display
display_optimization_result renders a score comparison widget at the top of the report, followed by the full Markdown analysis. The widget shows a before/after table with colour-coded Δ Change cells (green for improvement, red for regression) and a footer line for strategy, optimization status, and business-logic preservation.
from tdfs4ds.agent import display_optimization_result
display_optimization_result(result)
FilterManager integration
When filtermanager_applicable is True, the report includes a ready-to-use code snippet for iterating over partitions one at a time — reducing per-run spool usage and enabling full partition elimination:
fm = tdfs4ds.FilterManager(
schema_name = tdfs4ds.SCHEMA,
view_name = 'TRANSACTIONS',
col_names = ['transaction_date'],
)
for filter_id in range(fm.nb_filters):
fm.update(filter_id)
tdfs4ds.run(process_id)
Notebook
11 - Query Optimizer Agent with tdfs4ds.ipynb (notebook dev/genai/) walks through the full pipeline end-to-end — process-aware entry, multi-turn follow-ups, score comparison widget, and FilterManager recommendation.
Discover Registered Features
from tdfs4ds.feature_store.feature_query_retrieval import (
get_list_entity,
get_list_features,
get_available_features,
get_feature_versions,
)
Lineage
The lineage module builds end-to-end dependency graphs from a SQL query or a dataset view DDL.
Dependency graph
from tdfs4ds.lineage import build_teradata_dependency_graph, plot_lineage_sankey, show_plotly_robust
# Start from a dataset view DDL (obtained via SHOW VIEW)
sql = tdml.execute_sql("SHOW VIEW DATASET_CUSTOMER").fetchall()[0][0]
graph = build_teradata_dependency_graph(sql_query=sql)
# Returns: {"nodes": {...}, "edges": [...], "roots": [...]}
By default (expand_datasets_via_process_catalog=True) dataset nodes are resolved through
the process catalog: FEATURE_VERSION UUIDs embedded in the dataset DDL are matched to
PROCESS_ID entries in FS_V_PROCESS_CATALOG, and edges are drawn directly to the
registered feature-engineering views.
DATASET_CUSTOMER → FEAT_ENG_CUST → DB_SOURCE.TRANSACTIONS
Set expand_datasets_via_process_catalog=False to connect the dataset directly to the
raw feature-store storage tables (previous behaviour).
fig = plot_lineage_sankey(graph, title="Customer Dataset Lineage")
show_plotly_robust(fig)
Migration manifest
graph_to_migration_manifest converts any lineage graph into a flat, JSON-serialisable
dict — useful for planning a feature store migration.
from tdfs4ds.lineage import graph_to_migration_manifest
import json
# All databases
manifest = graph_to_migration_manifest(graph)
# Scoped to the feature store schema only (cross-boundary edges excluded)
manifest_fs = graph_to_migration_manifest(graph, filter_database=tdfs4ds.SCHEMA)
print(json.dumps(manifest_fs, indent=2))
# {
# "views": [{"database": "demo_user", "name": "DATASET_CUSTOMER", "type": "dataset"},
# {"database": "demo_user", "name": "FEAT_ENG_CUST", "type": "view"}],
# "tables": [],
# "edges": [{"from": "demo_user.DATASET_CUSTOMER", "to": "demo_user.FEAT_ENG_CUST"}]
# }
with open("migration_manifest.json", "w") as f:
json.dump(manifest_fs, f, indent=2)
Requirements
- Python >= 3.6
- teradataml >= 17.20
- Active Teradata Vantage connection
- VALIDTIME temporal tables must be enabled on the Teradata Vantage system — all feature catalogs, process catalogs, and feature stores rely on
VALIDTIMEsupport
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tdfs4ds-0.2.8.1-py3-none-any.whl.
File metadata
- Download URL: tdfs4ds-0.2.8.1-py3-none-any.whl
- Upload date:
- Size: 673.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e01ad6ddecdaeabd6775411fae8d18052d97efa2c7cc89007385c225c1b7e08
|
|
| MD5 |
72ab8ed8894ffa379e3cf94ccb733fe6
|
|
| BLAKE2b-256 |
934ab9a8fcfb0f9dd2f289f76c76ff02cf3decb4f27d53a19fda32750aa97224
|