A utility to generate ML features from yaml
Project description
feature store utils
A light-weight package that allows you express ML features in simple yaml, build a training data set and then write them to a feature store.
some general thoughts on building a training dataset
https://docs.google.com/presentation/d/1tVkrwCLVwFp8cZC7CmAHSNFhsJrcTdC20MlZfptkSBE/edit?usp=sharing
options for use
- clone this repo. create features.yaml. follow demo notebook. do not check back in.
- create you own repo and install as package (currently in testpypi). See https://github.com/BenMacKenzie/churn_model_demo as an example. Note that you must create a .env file in folder which contains the features.yaml file
Notes
- Current version is experimental. Not clear that Jinja is the right way to write parameterized SQL. Might be better to do in Python.
- Current version is not optimized. Each feature is calculated individually, whereas if table, filters and time windows are identical, multiple aggregation features can be calculated simultaneously.
- I believe there are around a dozen standard feature types. The most common have been implemented. Note that views can fill in a lot of gaps if encountered. missing:
- type 1 lookup.
- 1st order aggregations over time series (e.g., just treat it like a fact table)
- 2nd order aggregations over time series e.g., max monthly job dbu over 6 month window.
- time in state, e.g., how long was a ticket open. based on a type 2 table.
- time to event in fact table, e.g., time since last call to customer support
- scalar functions of two or more features, e.g, time in days between two date
- num state changes over interval (rare)
- functions of features (e.g., ratio of growth in job dbu to interactive dbu). Arguably this is not needed for boosted trees. Might be useful for neural nets...but why use a nueral net on heterogeneous data? (actually this kind of thing can be good for model explainability)
- Need to illustrate adding features from a related dimension table (using a foreign key...machinery is in place to do so.)
- Current version illustrates creating a pipeline which uses the api. But it would be nice just to generate the code and write it to a notebook so that the package is invisible in production (like bamboolib)
- The demo repo (https://github.com/BenMacKenzie/churn_model_demo) illustrates 'hyper-features' which are features with variable parameters.
- Connecting 'hyper-features' to feature store needs to be worked out. Currently the option is to add all of them or specify individual version by their (generated) name
- Fix feature store feature gen observation dates. Align with grain of feature, e.g., if grain is monthly make sure feature store contains an observation on first of month.
Building
python3 -m build
python3 -m twine upload --repository testpypi dist/*
python3 -m twine upload dist/*
Running unit tests on databricks
- install the databricks extension for vscode
- use this repo as a template. Note the following:
- remote_test_harness/pytest_databricks.py
- .vscode/launch.json
- write tests as usual (see tests/time_series/time_series_test.py as an example)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
feature_store_utils-0.0.3.tar.gz
(11.0 kB
view details)
Built Distribution
File details
Details for the file feature_store_utils-0.0.3.tar.gz
.
File metadata
- Download URL: feature_store_utils-0.0.3.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf5652b44c98009fd70de674a2c6749544fcf2ece77c4540aac0b9b239c378f0 |
|
MD5 | e53ff1b19d22b0cee799c5214c457ca5 |
|
BLAKE2b-256 | 88004e3fbeab07724c0b3b5ab9b5ae484d4a2f80aded68375602df928a4c2ba2 |
File details
Details for the file feature_store_utils-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: feature_store_utils-0.0.3-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 233f1be9002c540da4e2b81087642b25e2932c6b62fe1213697f85017499b5cf |
|
MD5 | a5909be9289db57af6e5945ea5668ccf |
|
BLAKE2b-256 | 838c44b534f15ad944ef7d5f61589141981c6e81e6c2975add9927f4a42e5515 |