Pipelines and primitives for machine learning and data science.
Project description
An Open Source Project from the Data to AI Lab, at MIT
MLPrimitives
Pipelines and primitives for machine learning and data science.
- Documentation: https://MLBazaar.github.io/MLPrimitives
- Github: https://github.com/MLBazaar/MLPrimitives
- License: MIT
- Development Status: Pre-Alpha
Overview
This repository contains primitive annotations to be used by the MLBlocks library, as well as the necessary Python code to make some of them fully compatible with the MLBlocks API requirements.
There is also a collection of custom primitives contributed directly to this library, which either combine third party tools or implement new functionalities from scratch.
Why did we create this library?
- Too many libraries in a fast growing field
- Huge societal need to build machine learning apps
- Domain expertise resides at several places (knowledge of math)
- No documented information about hyperparameters, behavior...
Installation
Requirements
MLPrimitives has been developed and tested on Python 3.8, 3.9, 3.10, and 3.11
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where MLPrimitives is run.
Install with pip
The easiest and recommended way to install MLPrimitives is using pip:
pip install mlprimitives
This will pull and install the latest stable release from PyPi.
If you want to install from source or contribute to the project please read the Contributing Guide.
Quickstart
This section is a short series of tutorials to help you getting started with MLPrimitives.
In the following steps you will learn how to load and run a primitive on some data.
Later on you will learn how to evaluate and improve the performance of a primitive by tuning its hyperparameters.
Running a Primitive
In this first tutorial, we will be executing a single primitive for data transformation.
1. Load a Primitive
The first step in order to run a primitive is to load it.
This will be done using the mlprimitives.load_primitive
function, which will
load the indicated primitive as an MLBlock Object from MLBlocks
In this case, we will load the mlprimitives.custom.feature_extraction.CategoricalEncoder
primitive.
from mlprimitives import load_primitive
primitive = load_primitive('mlprimitives.custom.feature_extraction.CategoricalEncoder')
2. Load some data
The CategoricalEncoder is a transformation primitive which applies one-hot encoding to all the
categorical columns of a pandas.DataFrame
.
So, in order to be able to run our primitive, we will first load some data that contains categorical columns.
This can be done with the mlprimitives.datasets.load_census
function:
from mlprimitives.datasets import load_census
dataset = load_census()
This dataset object has an attribute data
which contains a table with several categorical
columns.
We can have a look at this table by executing dataset.data.head()
, which will return a
table like this:
0 1 2
age 39 50 38
workclass State-gov Self-emp-not-inc Private
fnlwgt 77516 83311 215646
education Bachelors Bachelors HS-grad
education-num 13 13 9
marital-status Never-married Married-civ-spouse Divorced
occupation Adm-clerical Exec-managerial Handlers-cleaners
relationship Not-in-family Husband Not-in-family
race White White White
sex Male Male Male
capital-gain 2174 0 0
capital-loss 0 0 0
hours-per-week 40 13 40
native-country United-States United-States United-States
3. Fit the primitive
In order to run our pipeline, we first need to fit it.
This is the process where it analyzes the data to detect which columns are categorical
This is done by calling its fit
method and assing the dataset.data
as X
.
primitive.fit(X=dataset.data)
4. Produce results
Once the pipeline is fit, we can process the data by calling the produce
method of the
primitive instance and passing agin the data
as X
.
transformed = primitive.produce(X=dataset.data)
After this is done, we can see how the transformed data contains the newly generated one-hot vectors:
0 1 2 3 4
age 39 50 38 53 28
fnlwgt 77516 83311 215646 234721 338409
education-num 13 13 9 7 13
capital-gain 2174 0 0 0 0
capital-loss 0 0 0 0 0
hours-per-week 40 13 40 40 40
workclass= Private 0 0 1 1 1
workclass= Self-emp-not-inc 0 1 0 0 0
workclass= Local-gov 0 0 0 0 0
workclass= ? 0 0 0 0 0
workclass= State-gov 1 0 0 0 0
workclass= Self-emp-inc 0 0 0 0 0
... ... ... ... ... ...
Tuning a Primitive
In this short tutorial we will teach you how to evaluate the performance of a primitive and improve its performance by modifying its hyperparameters.
To do so, we will load a primitive that can learn from the transformed data that we just generated and later on make predictions based on new data.
1. Load another primitive
Firs of all, we will load the xgboost.XGBClassifier
primitive that we will use afterwards.
primitive = load_primitive('xgboost.XGBClassifier')
2. Split the dataset
Before being able to evaluate the primitive perfomance, we need to split the data in two parts: train, which will be used for the primitive to learn, and test, which will be used to make the predictions that later on will be evaluated.
In order to do this, we will get the first 75% of rows from the transformed data that we
obtained above and call it X_train
, and then set the next 25% of rows as X_test
.
train_size = int(len(transformed) * 0.75)
X_train = transformed.iloc[:train_size]
X_test = transformed.iloc[train_size:]
Similarly, we need to obtain the y_train
and y_test
variables containing the corresponding
output values.
y_train = dataset.target[:train_size]
y_test = dataset.target[train_size:]
3. Fit the new primitive
Once we have have splitted the data, we can fit the primitive by passing X_train
and y_train
to its fit
method.
primitive.fit(X=X_train, y=y_train)
4. Make predictions
Once the primitive has been fitted, we can produce predictions using the X_test
data as input.
predictions = primitive.produce(X=X_test)
5. Evalute the performance
We can now evaluate how good the predictions from our primitive are by using the score
method from the dataset
object on both the expected output and the real output from the
primitive:
dataset.score(y_test, predictions)
This will output a float value between 0 and 1 indicating how good the predicitons are, being 0 the worst score possible and 1 the best one.
In this case we will obtain a score around 0.866
6. Set new hyperparameter values
In order to improve the performance of our primitive we will try to modify a couple of its hyperparameters.
First we will see which hyperparameter values the primitive has by calling its
get_hyperparameters
method.
primitive.get_hyperparameters()
which will return a dictionary like this:
{
"n_jobs": -1,
"n_estimators": 100,
"max_depth": 3,
"learning_rate": 0.1,
"gamma": 0,
"min_child_weight": 1
}
Next, we will see which are the valid values for each one of those hyperparameters by calling its
get_tunable_hyperparameters
method:
primitive.get_tunable_hyperparameters()
For example, we will see that the max_depth
hyperparameter has the following specification:
{
"type": "int",
"default": 3,
"range": [
3,
10
]
}
Next, we will choose a valid value, for example 7, and set it into the pipeline using the
set_hyperparameters
method:
primitive.set_hyperparameters({'max_depth': 7})
7. Re-evaluate the performance
Once the new hyperparameter value has been set, we repeat the fit/train/score cycle to evaluate the performance of this new hyperparameter value:
primitive.fit(X=X_train, y=y_train)
predictions = primitive.produce(X=X_test)
dataset.score(y_test, predictions)
This time we should see that the performance has improved to a value around 0.724
What's Next?
Do you want to learn more about how the project, about how to contribute to it or browse the API Reference? Please check the corresponding sections of the documentation!
History
0.4.1 - 2024-11-15
Primitive Improvements
- SimpleImputer primitive update – Issue #280 by @sarahmish
0.4.0 - 2024-03-22
General Imporvements
- Upgrade python versions 3.9, 3.10, and 3.11 - Issue #279 by @sarahmish
- Adapt to statsmodels.tsa.arima_model.ARIMA deprecation - Issue #253 by @sarahmish
0.3.5 - 2023-04-14
General Imporvements
- Update
mlblocks
cap - Issue #278 by @sarahmish
0.3.4 - 2023-01-24
General Imporvements
- Update
mlblocks
cap - Issue #277 by @sarahmish
0.3.3 - 2023-01-20
General Imporvements
- Update dependencies - Issue #276 by @sarahmish
Adapter Improvements
- Building model within fit in keras adapter- Issue #267 by @sarahmish
0.3.2 - 2021-11-09
Adapter Improvements
- Inferring data shapes with single dimension for keras adapter - Issue #265 by @sarahmish
0.3.1 - 2021-10-07
Adapter Improvements
- Dynamic target_shape in keras adapter - Issue #263 by @sarahmish
- Save keras primitives in Windows environment - Issue #261 by @sarahmish
General Imporvements
- Update TensorFlow and NumPy dependency - Issue #259 by @sarahmish
0.3.0 - 2021-01-09
New Primitives
- Add primitive
sklearn.naive_bayes.GaussianNB
- Issue #242 by @sarahmish - Add primitive
sklearn.linear_model.SGDClassifier
- Issue #241 by @sarahmish
Primitive Improvements
- Add offset to rolling_window_sequence primitive - Issue #251 by @skyeeiskowitz
- Rename the time_index column to time - Issue #252 by @pvk-developer
- Update featuretools dependency - Issue #250 by @pvk-developer
General Improvements
- Udpate dependencies and add python3.8 - Issue #246 by @csala
- Drop Python35 - Issue #244 by @csala
0.2.5 - 2020-07-29
Primitive Improvements
- Accept timedelta
window_size
incutoff_window_sequences
- Issue #239 by @joanvaquer
Bug Fixes
- ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via
pip install tensorflow
- Issue #237 by @joanvaquer
New Primitives
- Add
pandas.DataFrame.set_index
primitive - Issue #222 by @JDTheRipperPC
0.2.4 - 2020-01-30
New Primitives
- Add RangeScaler and RangeUnscaler primitives - Issue #232 by @csala
Primitive Improvements
- Extract input_shape from X in keras.Sequential - Issue #223 by @csala
Bug Fixes
- mlprimitives.custom.text.TextCleaner fails if text is empty - Issue #228 by @csala
- Error when loading the reviews dataset - Issue #230 by @csala
- Curate dependencies: specify an explicit prompt-toolkit version range - Issue #224 by @csala
0.2.3 - 2019-11-14
New Primitives
- Add primitive to make window_sequences based on cutoff times - Issue #217 by @csala
- Create a keras LSTM based TimeSeriesClassifier primitive - Issue #218 by @csala
- Add pandas DataFrame primitives - Issue #214 by @csala
- Add featuretools.EntitySet.normalize_entity primitive - Issue #209 by @csala
Primitive Improvements
-
Make featuretools.EntitySet.entity_from_dataframe entityset arg optional - Issue #208 by @csala
-
Add text regression dataset - Issue #206 by @csala
Bug Fixes
- pandas.DataFrame.resample crash when grouping by integer columns - Issue #211 by @csala
0.2.2 - 2019-10-08
New Primitives
- Add primitives for GAN based time-series anomaly detection - Issue #200 by @AlexanderGeiger
- Add
numpy.reshape
andnumpy.ravel
primitives - Issue #197 by @AlexanderGeiger - Add feature selection primitive based on Lasso - Issue #194 by @csala
Primitive Improvements
feature_extraction.CategoricalEncoder
support dtype category - Issue #196 by @csala
0.2.1 - 2019-09-09
New Primitives
- Timeseries Intervals to Mask Primitive - Issue #186 by @AlexanderGeiger
- Add new primitive: Arima model - Issue #168 by @AlexanderGeiger
Primitive Improvements
- Curate PCA primitive hyperparameters - Issue #190 by @AlexanderGeiger
- Add option to drop rolling window sequences - Issue #186 by @AlexanderGeiger
Bug Fixes
- scikit-image==0.14.3 crashes when installed on Mac - Issue #188 by @csala
0.2.0
New Features
- Publish the pipelines as an
entry_point
Issue #175 by @csala
Primitive Improvements
- Improve pandas.DataFrame.resample primitive Issue #177 by @csala
- Improve
feature_extractor
primitives Issue #183 by @csala - Improve
find_anomalies
primitive Issue #180 by @AlexanderGeiger
Bug Fixes
- Typo in the primitive keras.Sequential.LSTMTimeSeriesRegressor Issue #176 by @DanielCalvoCerezo
0.1.10
New Features
- Add function to run primitives without a pipeline Issue #43 by @csala
New Pipelines
- Add pipelines for all the MLBlocks examples Issue #162 by @csala
Primitive Improvements
- Add Early Stopping to
keras.Sequential.LSTMTimeSeriesRegressor
primitive Issue #156 by @csala - Make FeatureExtractor primitives accept Numpy arrays Issue #165 by @csala
- Add window size and pruning to the
timeseries_anomalies.find_anomalies
primitive Issue #160 by @csala
0.1.9
New Features
- Add a single table binary classification dataset Issue #141 by @csala
New Primitives
- Add Multilayer Perceptron (MLP) primitive for binary classification Issue #140 by @Hector-hedb12
- Add primitive for Sequence classification with LSTM Issue #150 by @Hector-hedb12
- Add VGG-like convnet primitive Issue #149 by @Hector-hedb12
- Add Multilayer Perceptron (MLP) primitive for multi-class softmax classification Issue #139 by @Hector-hedb12
- Add primitive to count feature matrix columns Issue #146 by @csala
Primitive Improvements
- Add additional fit and predict arguments to keras.Sequential Issue #161 by @csala
- Add suport for keras.Sequential Callbacks Issue #159 by @csala
- Add fixed hyperparam to control keras.Sequential verbosity Issue #143 by @csala
0.1.8
New Primitives
- mlprimitives.custom.timeseries_preprocessing.time_segments_average - Issue #137
New Features
- Add target_index output in timseries_preprocessing.rolling_window_sequences - Issue #136
0.1.7
General Improvements
- Validate JSON format in
make lint
- Issue #133 - Add demo datasets - Issue #131
- Improve featuretools.dfs primitive - Issue #127
New Primitives
- pandas.DataFrame.resample - Issue #123
- pandas.DataFrame.unstack - Issue #124
- featuretools.EntitySet.add_relationship - Issue #126
- featuretools.EntitySet.entity_from_dataframe - Issue #126
Bug Fixes
- Bug in timeseries_anomalies.py - Issue #119
0.1.6
General Improvements
- Add Contributing Documentation
- Remove upper bound in pandas version given new release of
featuretools
v0.6.1 - Improve LSTMTimeSeriesRegressor hyperparameters
New Primitives
- mlprimitives.candidates.dsp.SpectralMask
- mlprimitives.custom.timeseries_anomalies.find_anomalies
- mlprimitives.custom.timeseries_anomalies.regression_errors
- mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences
- mlprimitives.custom.timeseries_preprocessing.time_segments_average
- sklearn.linear_model.ElasticNet
- sklearn.linear_model.Lars
- sklearn.linear_model.Lasso
- sklearn.linear_model.MultiTaskLasso
- sklearn.linear_model.Ridge
0.1.5
New Primitives
- sklearn.impute.SimpleImputer
- sklearn.preprocessing.MinMaxScaler
- sklearn.preprocessing.MaxAbsScaler
- sklearn.preprocessing.RobustScaler
- sklearn.linear_model.LinearRegression
General Improvements
- Separate curated from candidate primitives
- Setup
entry_points
in setup.py to improve compaitibility with MLBlocks - Add a test-pipelines command to test all the existing pipelines
- Clean sklearn example pipelines
- Change the
author
entry to acontributors
list - Change the name of
mlblocks_primitives
folder - Pip install
requirements_dev.txt
fail documentation
Bug Fixes
- Fix LSTMTimeSeriesRegressor primitive. Issue #90
- Fix timeseries primitives. Issue #91
- Negative index anomalies in
timeseries_errors
. Issue #89 - Keep pandas version below 0.24.0. Issue #87
0.1.4
New Primitives
- mlprimitives.timeseries primitives for timeseries data preprocessing
- mlprimitives.timeseres_error primitives for timeseries anomaly detection
- keras.Sequential.LSTMTimeSeriesRegressor
- sklearn.neighbors.KNeighbors Classifier and Regressor
- several sklearn.decomposition primitives
- several sklearn.ensemble primitives
Bug Fixes
- Fix typo in mlprimitives.text.TextCleaner primitive
- Fix bug in index handling in featuretools.dfs primitive
- Fix bug in SingleLayerCNNImageClassifier annotation
- Remove old vlaidation tags from JSON annotations
0.1.3
New Features
- Fix and re-enable featuretools.dfs primitive.
0.1.2
New Features
- Add pipeline specification language and Evaluation utilities.
- Add pipelines for graph, text and tabular problems.
- New primitives ClassEncoder and ClassDecoder
- New primitives UniqueCounter and VocabularyCounter
Bug Fixes
- Fix TrivialPredictor bug when working with numpy arrays
- Change XGB default learning rate and number of estimators
0.1.1
New Features
- Add more keras.applications primitives.
- Add a Text Cleanup primitive.
Bug Fixes
- Add keywords to
keras.preprocessing
primtives. - Fix the
image_transform
method. - Add
epoch
as a fixed hyperparameter forkeras.Sequential
primitives.
0.1.0
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mlprimitives-0.4.1.tar.gz
.
File metadata
- Download URL: mlprimitives-0.4.1.tar.gz
- Upload date:
- Size: 112.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.11.2 readme-renderer/43.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.67.0 importlib-metadata/4.13.0 keyring/25.5.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91f1e02731e6996928cab2129794777b643c5507878acf06ac57fc3289c6bab2 |
|
MD5 | fd52e3ee60d40e9b7fededfa0fcf6fd8 |
|
BLAKE2b-256 | f95cd61602faaf1325691d2c451ca607d11fd4174b015a0d4a18c1a1e7b3f589 |
File details
Details for the file mlprimitives-0.4.1-py2.py3-none-any.whl
.
File metadata
- Download URL: mlprimitives-0.4.1-py2.py3-none-any.whl
- Upload date:
- Size: 184.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.11.2 readme-renderer/43.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.67.0 importlib-metadata/4.13.0 keyring/25.5.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fcec748295050133ee001eabcd64b5056d2c2026e6534e6308b3da1610ede67 |
|
MD5 | 6f642fd2d66d3c64fdbcd9f74b15a9d1 |
|
BLAKE2b-256 | 94fe9c58c06be9ce1d9bfeaceb3320346a98bd36e4791f6202b6ec0f54b51655 |