Skip to main content

A utility to generate ML features from yaml

Project description

feature store utils

A light-weight package that allows you express ML features in simple yaml, build a training data set and then write them to a feature store.

some general thoughts on building a training dataset

https://docs.google.com/presentation/d/1tVkrwCLVwFp8cZC7CmAHSNFhsJrcTdC20MlZfptkSBE/edit?usp=sharing

options for use

  1. clone this repo. create features.yaml. follow demo notebook. do not check back in.
  2. create you own repo and install as package (currently in testpypi). See https://github.com/BenMacKenzie/churn_model_demo as an example. Note that you must create a .env file in folder which contains the features.yaml file

Notes

  1. Current version is experimental. Not clear that Jinja is the right way to write parameterized SQL. Might be better to do in Python.
  2. Current version is not optimized. Each feature is calculated individually, whereas if table, filters and time windows are identical, multiple aggregation features can be calculated simultaneously.
  3. I believe there are around a dozen standard feature types. The most common have been implemented. Note that views can fill in a lot of gaps if encountered. missing:
  • type 1 lookup.
  • 1st order aggregations over time series (e.g., just treat it like a fact table)
  • 2nd order aggregations over time series e.g., max monthly job dbu over 6 month window.
  • time in state, e.g., how long was a ticket open. based on a type 2 table.
  • time to event in fact table, e.g., time since last call to customer support
  • scalar functions of two or more features, e.g, time in days between two date
  • num state changes over interval (rare)
  • functions of features (e.g., ratio of growth in job dbu to interactive dbu). Arguably this is not needed for boosted trees. Might be useful for neural nets...but why use a nueral net on heterogeneous data? (actually this kind of thing can be good for model explainability)
  1. Need to illustrate adding features from a related dimension table (using a foreign key...machinery is in place to do so.)
  2. Current version illustrates creating a pipeline which uses the api. But it would be nice just to generate the code and write it to a notebook so that the package is invisible in production (like bamboolib)
  3. The demo repo (https://github.com/BenMacKenzie/churn_model_demo) illustrates 'hyper-features' which are features with variable parameters.
  4. Connecting 'hyper-features' to feature store needs to be worked out. Currently the option is to add all of them or specify individual version by their (generated) name
  5. Fix feature store feature gen observation dates. Align with grain of feature, e.g., if grain is monthly make sure feature store contains an observation on first of month.

Building

python3 -m build  
python3 -m twine upload --repository testpypi dist/*

python3 -m twine upload dist/*

Running unit tests on databricks

  1. install the databricks extension for vscode
  2. use this repo as a template. Note the following:
  3. remote_test_harness/pytest_databricks.py
  4. .vscode/launch.json
  5. write tests as usual (see tests/time_series/time_series_test.py as an example)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_store_utils-0.0.3.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

feature_store_utils-0.0.3-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file feature_store_utils-0.0.3.tar.gz.

File metadata

  • Download URL: feature_store_utils-0.0.3.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for feature_store_utils-0.0.3.tar.gz
Algorithm Hash digest
SHA256 bf5652b44c98009fd70de674a2c6749544fcf2ece77c4540aac0b9b239c378f0
MD5 e53ff1b19d22b0cee799c5214c457ca5
BLAKE2b-256 88004e3fbeab07724c0b3b5ab9b5ae484d4a2f80aded68375602df928a4c2ba2

See more details on using hashes here.

File details

Details for the file feature_store_utils-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for feature_store_utils-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 233f1be9002c540da4e2b81087642b25e2932c6b62fe1213697f85017499b5cf
MD5 a5909be9289db57af6e5945ea5668ccf
BLAKE2b-256 838c44b534f15ad944ef7d5f61589141981c6e81e6c2975add9927f4a42e5515

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page