A package for machine learning with tabular data
Project description
TabML: a Machine Learning pipeline for tabular data
Introduction
This is an active project that aims to create a general machine learning framework for working with tabular data.
Key features:
-
One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features).
-
Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose.
-
Support multiple ML packages for tabular data:
Installation
pip install tabml
Main components
In TRAINING step,
-
The FeatureManager class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a
fit
step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in thetransform
step. One such usage is in the serving step when there is onlytransform
step. For each project, there is onefeature_manager.py
file which specifies how each feature is computed (example). The computation order as well as feature dependencies are specified in a yaml config file (example). -
The DataLoader loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file (example). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected.
-
The ModelWrapper class defines the model, how to train it and other methods for loading the model and making predictions.
-
The ModelAnalysis analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.
In SERVING step, raw data is fed into the fitted FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features.
Examples
Please check the examples
folder for several example projects. For each project:
python feature_manager.py # to generate features
python pipelines.py # to train the model
You can change some parameters in the config file then run python pipelines.py
again.
In most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in feature_manager.py
.
Setup for development
Add path to this repo
Add the following lines to your shell config file (~/.bashrc
, ~/.zshrc
or any shell config file of
your choice):
export TABML=<local_path_to_this_git_repo>
alias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install'
Create the environment
cd $TABML
python3 -m venv tabml_env
source tabml_env/bin/activate
pip3 install -r requirements.txt
Setup pre-commit to auto format code when creating a git commit:
pre-commit install
Check that everthing is working
by running test
2tabml
python3 -m pytest ./tests ./examples
Author's notes
How to release a new version
-
Increase
version
insetup.py
as in this PR example. -
Generate tar file:
python setup.py sdist
- Upload tar file:
twine upload dist/tabml-x.x.xx.tar.gz
Common errors
- SHAP
SHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. Related issue.
- LightGBM
pip install lightgbm
might not work for MacOS, try to follow official installation guide for mac.
If you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tabml-0.2.9.tar.gz
.
File metadata
- Download URL: tabml-0.2.9.tar.gz
- Upload date:
- Size: 39.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 385b2bd366c4735d809e5967bb2df349ac6bf9520d96232f6516ffcf4a0bf81d |
|
MD5 | a07d6411b588348e4be86f02772a3c42 |
|
BLAKE2b-256 | 13641b7e678803f1691f613caa9a958c1171ad470adb74463c4b9559a6e1ea1b |