A utility to generate ML features from yaml
Project description
feature store utils
A light-weight package that allows you express ML features in simple yaml, build a training data set and then write them to a feature store.
some general thoughts on building a training dataset
https://docs.google.com/presentation/d/1tVkrwCLVwFp8cZC7CmAHSNFhsJrcTdC20MlZfptkSBE/edit?usp=sharing
options for use
- clone this repo. create features.yaml. follow demo notebook. do not check back in.
- install as a python package. See https://github.com/BenMacKenzie/churn_model_demo as an example. Note that you must create a .env file in folder which contains the features.yaml file
Notes
- Current version is experimental. Not clear that Jinja is the right way to write parameterized SQL. Might be better to do in Python.
- Current version is not optimized. Each feature is calculated individually, whereas if table, filters and time windows are identical, multiple aggregation features can be calculated simultaneously.
- I believe there are around a dozen standard feature types. The most common have been implemented. Note that views can fill in a lot of gaps if encountered. missing:
- type 1 lookup.
- 1st order aggregations over time series (e.g., just treat it like a fact table)
- 2nd order aggregations over time series e.g., max monthly job dbu over 6 month window.
- time in state, e.g., how long was a ticket open. based on a type 2 table.
- time to event in fact table, e.g., time since last call to customer support
- scalar functions of two or more features, e.g, time in days between two date
- num state changes over interval (rare)
- functions of features (e.g., ratio of growth in job dbu to interactive dbu). Arguably this is not needed for boosted trees. Might be useful for neural nets...but why use a nueral net on heterogeneous data? (actually this kind of thing can be good for model explainability)
- Need to illustrate adding features from a related dimension table (using a foreign key...machinery is in place to do so.)
- Current version illustrates creating a pipeline which uses the api. But it would be nice just to generate the code and write it to a notebook so that the package is invisible in production (like bamboolib)
- The demo repo (https://github.com/BenMacKenzie/churn_model_demo) illustrates 'hyper-features' which are features with variable parameters.
- Connecting 'hyper-features' to feature store needs to be worked out. Currently the option is to add all of them or specify individual version by their (generated) name
- Fix feature store feature gen observation dates. Align with grain of feature, e.g., if grain is monthly make sure feature store contains an observation on first of month.
Building
python3 -m build
python3 -m twine upload --repository testpypi dist/*
python3 -m twine upload dist/*
Running unit tests on databricks
- install the databricks extension for vscode
- use this repo as a template. Note the following:
- remote_test_harness/pytest_databricks.py
- .vscode/launch.json
- write tests as usual (see tests/time_series/time_series_test.py as an example)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
feature_store_utils-0.0.5.tar.gz
(11.1 kB
view details)
Built Distribution
File details
Details for the file feature_store_utils-0.0.5.tar.gz
.
File metadata
- Download URL: feature_store_utils-0.0.5.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aabb3d6bb266a8f3b322f16360e708877aeb6870ebf43fdf8574ca899aeee9d |
|
MD5 | 82949bb708321ef8ef74b2a4b8c3eb84 |
|
BLAKE2b-256 | 2eeb3e3ca73608c08d3598ef3e36f74d1443dfd5cbb59ee7a8443fbcc3a94ec1 |
File details
Details for the file feature_store_utils-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: feature_store_utils-0.0.5-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4234392cf4632d44eedf74dae3c505be40273efae4be8198bcbf6deb9ca7611 |
|
MD5 | de2aa9fa04b0213505847d26c354a87e |
|
BLAKE2b-256 | 7982a4d378e056255fbe0a22156c6c2da3444bdbe40964cc616f0eefd3079daa |