A python package to simplify the usage of feature store using Teradata Vantage ...

Project description

tdfs4ds : A Feature Store Library for Data Scientists working with Clearscape Analytics

The tdfs library is a Python package designed for managing and utilizing Feature Stores in a Teradata Database. With a set of easy-to-use functions, tdfs enables the efficient creation, registration, and storage of features. It also simplifies the process of preparing feature data for ingestion, building datasets for data analysis, and obtaining already existing features.

Getting Started

Install the tdfs package via pip:

pip install tdfs4ds

To utilize the functionality of the tdfs4ds package, import it in your Python script:

import tdfs4ds

It is recommended to import it after creating the context with teradataml in order to use the connection parameters to get the feature store database as the default database. Otherwise you can specify it as follows:

tdfs4ds.SCHEMA = <your database>

Getting started

The tdfs4ds package aims to be very simple and straightforward to start your feature store in a Vantage system and especially in your datalab. To start, you only need to master a restricted number of functions.

tdfs4ds.setup(database, if_exists='fail'): Creates a feature catalog table and a process catalog rable in the Teradata database you specify.
tdfs4ds.upload_features(df, entity_id, feature_names, metadata = {}):: Ingests the features calculated with the teradata dataframe df. You have to specify the entity_id, meaning the columns that define the unique ID in the result set. You have to specify the data type. Hence the entity_id is a dictionary with column name as key, and data type as value. e.g. {'ID': 'BIGINT'}. feature_names is the list of column name corresponding to the features you want to ingest in the feature store. Do not hesitate to use the metadata argument to document your features.
tdfs4ds.build_dataset(entity_id, selected_features, view_name,comment = 'dataset'): Create a dataset view from the feature store. entity_id is the list of column names defining the entity, selected_features is the dictionary with feature names as key and feature version (meaning the process id that calculates these features) are values. The view_name is the name of the dataset view we want. Do not hesitate to use the comment argument to comment the view in the database.

These three functions is the core of the package. They manage for you the registering of entities or features when needed. They also manage the process catalog to simplify the operationalization of your feature engineering process.

I forgot to mention that the feature catalog, process catalog and feature stores are all temporal. Meaning that you can time travel by changing the tdfs4ds.FEATURE_STORE_TIME variable (the format is '9999-01-01 00:00:00' or None for current time).

Finally, it manages data_domain to avoid conflicts in feature and entity names across multiple use cases. The active data domain is stored in tdfs4ds.DATA_DOMAIN.

Example

Imagine you already have a feature engineering process implemented as a view in Teradata Vantage. If you do not have a feature store, the first step is to set up one:

Step 1: setup a feature store

After having created a context with teradataml, just type:

import tdfs4ds
tdfs4ds.setup(database=my_database)

this will create for you the feature and process catalogs in the database named my_database.

Step 2: connect to your feature store

Now we speficy the active database and the data domain our feature is dealing with.

tdfs4ds.DATA_DOMAIN = 'DATA_QUALITY'
tdfs4ds.SCHEMA      = my_database

These two parameters are initialized with the default database when the tdfs4ds package is imported after the create_context call of teradataml that establish the connection with Vantage.

Step 3: feature engineering

In Vantage, almost any feature engineering process, being in SQL or involving external languages or engines can be implemented in a SQL view. A view can be handled as a teradata dataframe (a cousin of pandas dataframe).

df = tdml.DataFrame(tdml.in_schema(my_database, my_view))

assuming tdml is the alias of the teradataml import, and my_view is the view that implements the feature engineering process. If you apply additional transformation with teradataml, please use tdfs4ds.utils.lineage.crystallize_view to make permanent the views generated by teradataml.

Step 4: upload & operationalise

We define among the output columns of the view the columns dealing with the entity description, and the features to be ingested. We also add some metadata to describe our project.

from tdfs4ds.feature_store import upload_features
# Specify the entity ids in the view columns (+ data types) 
entity_id       = ['EVENT_DT' ,'ID']
# Specify the columns that contains the actual features
feature_names   = ['KPI1','KPI2]
# attach informative metadata to document your process
metadata        = {'project' : 'data quality'}
# upload & operationalise
upload_features(
    df=df,
    entity_id=entity_id,
    feature_names=feature_names,
    metadata={'project' : 'data quality'}
)

Here we go! This command will register the entities, register the features in the data domain if they are not yet registered, register a feature engineering process in the process catalog and register the features in the feature store, maintaining the lineage.

This function also outputs a teradata dataframe corresponding to the dataset you have just registered, i.e. the results of my_view.

You also get a process id. This process id can also be retrieved with:

from tdfs4ds.process_store.process_query_administration import list_processes
list_processes()

You can also run the process again and ingest new features in the feature store by using this process id as follows:

from tdfs4ds import run
run(process_id)

So no need to get the code that builds the feature engineering process, the process id is all what you need.

No worries if you are computing features that are already present in the feature store, the feature store is temporal so it will avoid any duplication of the feature and version the feature values when needed.

Building a new dataset with existing features

Now your feature store is populated, you can build any dataset knowing the entity_ids, the features and the process_id (or feature version) of your choice.

from tdfs4ds import build_dataset
mydataset = build_dataset(
    entity_id = ['customer_id'],
    selected_features = selected_features,
    view_name = 'mydataset', 
    comment = 'dataset for CHURN')

selected_feature is a dictionary where keys are the feature name, and value the process id corresponding to the process used for the computation of the feature value.

if you do not know the registered entities, features and corresponding feature versions you can use the functions get_list_entity, get_list_features, get_available_features and get_feature_versions in tdfs4ds.feature_store.feature_query_retrieval.

Here is the structure of the package:

.tdfs4ds
    ├── datasets.py
    │   └── Function: outstanding_amounts_dataset
    │   └── Function: upload_outstanding_amounts_dataset
    ├── __init__.py
    │   └── Function: _build_time_series
    │   └── Function: _upload_features
    │   └── Function: build_dataset
    │   └── Function: build_dataset_time_series
    │   └── Function: connect
    │   └── Function: feature_catalog
    │   └── Function: process_catalog
    │   └── Function: roll_out
    │   └── Function: run
    │   └── Function: setup
    │   └── Function: upload_features
    │   └── Function: upload_tdstone2_scores
    └── data
    └── feature_store
        ├── entity_management.py
        │   └── Function: register_entity
        │   └── Function: remove_entity
        │   └── Function: tdstone2_entity_id
        ├── feature_data_processing.py
        │   └── Function: _store_feature_merge
        │   └── Function: _store_feature_update_insert
        │   └── Function: prepare_feature_ingestion
        │   └── Function: prepare_feature_ingestion_tdstone2
        │   └── Function: store_feature
        ├── feature_query_retrieval.py
        │   └── Function: get_available_features
        │   └── Function: get_entity_tables
        │   └── Function: get_feature_store_content
        │   └── Function: get_feature_store_table_name
        │   └── Function: get_feature_versions
        │   └── Function: get_list_entity
        │   └── Function: get_list_features
        │   └── Function: list_features
        ├── feature_store_management.py
        │   └── Function: GetAlreadyExistingFeatureNames
        │   └── Function: GetTheLargestFeatureID
        │   └── Function: Gettdtypes
        │   └── Function: delete_feature
        │   └── Function: feature_store_catalog_creation
        │   └── Function: feature_store_table_creation
        │   └── Function: register_features
        │   └── Function: remove_feature
        │   └── Function: tdstone2_Gettdtypes
        ├── __init__.py
    └── process_store
        ├── process_query_administration.py
        │   └── Function: get_process_id
        │   └── Function: list_processes
        │   └── Function: remove_process
        ├── process_registration_management.py
        │   └── Function: register_process_tdstone
        │   └── Function: register_process_view
        ├── process_store_catalog_management.py
        │   └── Function: process_store_catalog_creation
        ├── __init__.py
    └── utils
        ├── info.py
        │   └── Function: get_column_types
        │   └── Function: get_column_types_simple
        ├── lineage.py
        │   └── Function: _analyze_sql_query
        │   └── Function: analyze_sql_query
        │   └── Function: crystallize_view
        │   └── Function: generate_view_dependency_network
        │   └── Function: generate_view_dependency_network_fs
        │   └── Function: get_ddl
        ├── query_management.py
        │   └── Function: execute_query
        │   └── Function: execute_query_wrapper
        │   └── Function: is_version_greater_than
        ├── time_management.py
        │   └── Class: TimeManager
        ├── visualization.py
        │   └── Function: display_table
        │   └── Function: linear_depth_layout
        │   └── Function: plot_graph
        │   └── Function: prepare_plotly_traces
        │   └── Function: radial_layout
        │   └── Function: segmented_linear_layout
        │   └── Function: visualize_graph
        ├── __init__.py

Project details

Release history Release notifications | RSS feed

0.2.2.84

Nov 4, 2024

0.2.2.83

Nov 4, 2024

0.2.2.82

Nov 4, 2024

0.2.2.81

Oct 30, 2024

0.2.2.80

Oct 29, 2024

0.2.2.79

Oct 29, 2024

0.2.2.78

Oct 28, 2024

0.2.2.77

Oct 28, 2024

0.2.2.76

Oct 25, 2024

0.2.2.75

Oct 25, 2024

0.2.2.74

Oct 25, 2024

0.2.2.73

Oct 25, 2024

0.2.2.72

Oct 15, 2024

0.2.2.71

Oct 3, 2024

0.2.2.70

Oct 3, 2024

0.2.2.69

Sep 25, 2024

0.2.2.68

Sep 25, 2024

0.2.2.67

Jul 18, 2024

0.2.2.66

Jul 17, 2024

0.2.2.65

Jul 10, 2024

0.2.2.64

Jul 10, 2024

0.2.2.63

Jul 10, 2024

0.2.2.62

Jul 10, 2024

0.2.2.61

Jul 10, 2024

0.2.2.60

Jul 8, 2024

0.2.2.59

Jul 6, 2024

0.2.2.58

Jul 6, 2024

0.2.2.57

Jul 6, 2024

0.2.2.56

Jul 6, 2024

0.2.2.55

Jul 5, 2024

0.2.2.54

Jul 5, 2024

0.2.2.53

Jul 5, 2024

0.2.2.52

Jul 5, 2024

0.2.2.51

Jul 5, 2024

0.2.2.50

Jul 4, 2024

0.2.2.49

Jul 3, 2024

0.2.2.48

Jul 3, 2024

0.2.2.47

Jul 3, 2024

0.2.2.46

Jun 28, 2024

0.2.2.45

Jun 28, 2024

0.2.2.44

Jun 28, 2024

0.2.2.43

Jun 28, 2024

0.2.2.42

Jun 28, 2024

0.2.2.41

Jun 27, 2024

0.2.2.40

Jun 27, 2024

0.2.2.39

Jun 27, 2024

0.2.2.38

Jun 27, 2024

This version

0.2.2.37

Jun 27, 2024

0.2.2.36

Jun 19, 2024

0.2.2.35

Jun 17, 2024

0.2.2.34

Jun 14, 2024

0.2.2.33

Jun 13, 2024

0.2.2.32

Jun 13, 2024

0.2.2.31

Jun 12, 2024

0.2.2.30

Jun 10, 2024

0.2.2.29

Jun 10, 2024

0.2.2.28

May 29, 2024

0.2.2.27

May 21, 2024

0.2.2.26

May 21, 2024

0.2.2.25

May 21, 2024

0.2.2.24

May 21, 2024

0.2.2.23

May 21, 2024

0.2.2.22

May 16, 2024

0.2.2.21

May 16, 2024

0.2.2.20

May 14, 2024

0.2.2.19

May 14, 2024

0.2.2.18

May 14, 2024

0.2.2.17

May 14, 2024

0.2.2.16

May 14, 2024

0.2.2.15

Apr 26, 2024

0.2.2.14

Apr 25, 2024

0.2.2.13

Apr 11, 2024

0.2.2.12

Apr 11, 2024

0.2.2.11

Apr 8, 2024

0.2.2.10

Apr 5, 2024

0.2.2.8

Mar 27, 2024

0.2.2.7

Mar 26, 2024

0.2.2.6

Mar 26, 2024

0.2.2.5

Mar 19, 2024

0.2.2.4

Mar 19, 2024

0.2.2.3

Mar 14, 2024

0.2.2.2

Mar 14, 2024

0.2.2.1

Mar 8, 2024

0.2.2.0

Mar 8, 2024

0.2.1.28

Mar 4, 2024

0.2.1.27

Mar 4, 2024

0.2.1.26

Feb 15, 2024

0.2.1.25

Feb 15, 2024

0.2.1.24

Feb 14, 2024

0.2.1.23

Feb 14, 2024

0.2.1.22

Feb 14, 2024

0.2.1.21

Feb 13, 2024

0.2.1.20

Feb 13, 2024

0.2.1.19

Feb 13, 2024

0.2.1.18

Feb 13, 2024

0.2.1.17

Feb 12, 2024

0.2.1.16

Feb 12, 2024

0.2.1.15

Feb 12, 2024

0.2.1.14

Feb 9, 2024

0.2.1.13

Feb 9, 2024

0.2.1.12

Feb 8, 2024

0.2.1.11

Feb 7, 2024

0.2.1.9

Feb 7, 2024

0.2.1.8

Feb 6, 2024

0.2.1.7

Feb 6, 2024

0.2.1.6

Feb 5, 2024

0.2.1.5

Feb 5, 2024

0.2.1.4

Feb 5, 2024

0.2.1.3

Feb 5, 2024

0.2.1.2

Feb 5, 2024

0.2.1.1

Feb 2, 2024

0.2.0.1

Feb 2, 2024

0.1.0.26

Jan 29, 2024

0.1.0.25

Jan 18, 2024

0.1.0.24

Jan 18, 2024

0.1.0.22

Jan 15, 2024

0.1.0.21

Jan 10, 2024

0.1.0.20

Dec 22, 2023

0.1.0.19

Dec 22, 2023

0.1.0.18

Dec 21, 2023

0.1.0.17

Dec 20, 2023

0.1.0.16

Dec 20, 2023

0.1.0.15

Dec 20, 2023

0.1.0.14

Dec 19, 2023

0.1.0.13

Dec 19, 2023

0.1.0.12

Dec 5, 2023

0.1.0.11

Dec 1, 2023

0.1.0.10

Dec 1, 2023

0.1.0.9

Nov 30, 2023

0.1.0.8

Nov 15, 2023

0.1.0.7

Nov 10, 2023

0.1.0.6

Sep 13, 2023

0.1.0.5

Sep 13, 2023

0.1.0.4

Sep 13, 2023

0.1.0.3

Sep 12, 2023

0.1.0.2

Sep 11, 2023

0.1.0.1

Jul 5, 2023

0.1.0.0

Jul 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

tdfs4ds-0.2.2.37-py3-none-any.whl (166.9 kB view hashes)

Uploaded Jun 27, 2024 Python 3

Hashes for tdfs4ds-0.2.2.37-py3-none-any.whl

Hashes for tdfs4ds-0.2.2.37-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05eec5a595f35bb0ba72d955e2674e1e5197bfb7c621a0558b64352610395a52`
MD5	`55c10662010d33e2ab06a54aad60e8ec`
BLAKE2b-256	`1c3edf02bcff6b801c126a52efca66717646b82bb97da4cbfb912027dbf9e721`