Pipelines and primitives for machine learning and data science.
Project description
Pipelines and Primitives for Machine Learning and Data Science.
MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface.
- Free software: MIT license
- Documentation: https://HDI-Project.github.io/MLBlocks
Installation
The simplest and recommended way to install MLBlocks is using pip
:
pip install mlblocks
Alternatively, you can also clone the repository and install it from sources
git clone git@github.com:HDI-Project/MLBlocks.git
cd MLBlocks
pip install -e .
Usage Example
Below there is a short example about how to use MLBlocks to create a simple pipeline, fit it using demo data and use it to make predictions.
For advance usage and more detailed explanation about each component, please have a look at the documentation
Additional Libraries
In order to be able to execute the given code snippets, you will need to install a couple of additional libraries, which you can do by running:
pip install mlblocks[demo]
Creating a pipeline
With MLBlocks, creating a pipeline is as simple as specifying a list of primitives and passing
them to the MLPipeline
class:
>>> from mlblocks import MLPipeline
>>> primitives = [
... 'sklearn.preprocessing.StandardScaler',
... 'xgboost.XGBClassifier'
... ]
>>> pipeline = MLPipeline(primitives)
Optionally, specific hyperparameters can be also set by specifying them in a dictionary:
>>> hyperparameters = {
... 'xgboost.XGBClassifier': {
... 'learning_rate': 0.1
... }
... }
>>> pipeline = MLPipeline(primitives, hyperparameters)
If you can see which hyperparameters a particular pipeline is using, you can do so by calling
its get_hyperparameters
method:
>>> import json
>>> hyperparameters = pipeline.get_hyperparameters()
>>> print(json.dumps(hyperparameters, indent=4))
{
"sklearn.preprocessing.StandardScaler#1": {
"with_mean": true,
"with_std": true
},
"xgboost.XGBClassifier#1": {
"n_jobs": -1,
"learning_rate": 0.1,
"n_estimators": 10,
"max_depth": 3,
"gamma": 0,
"min_child_weight": 1
}
}
Making predictions
Once we have created the pipeline with the desired hyperparameters we can fit it and then use it to make predictions on new data.
To do this, we first call the fit
method passing the training data and the corresponding labels.
>>> from mlblocks.datasets import load_iris
>>> dataset = load_iris()
>>> pipeline.fit(dataset.train_data, dataset.train_target)
Once we have fitted our model to our data, we can call the predict
method passing new data
to obtain predictions from the pipeline.
>>> predictions = pipeline.predict(dataset.test_data)
>>> predictions
array([2, 0, 1, 0, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 2, 1, 1, 0, 1,
0, 2, 0, 1, 0, 0, 1, 0, 1, 1, 1, 2, 2, 1, 2, 2])
>>> dataset.score(dataset.test_target, predictions)
0.9736842105263158
History
In its first iteration in 2015, MLBlocks was designed for only multi table, multi entity temporal data. A good reference to see our design rationale at that time is Bryan Collazo’s thesis:
- Machine learning blocks. Bryan Collazo. Masters thesis, MIT EECS, 2015.
With recent availability of a multitude of libraries and tools, we decided it was time to integrate them and expand the library to address other data types: images, text, graph, time series and integrate with deep learning libraries.
Changelog
0.2.2 - MLPipeline Load/Save
- Implement save and load methods for MLPipelines
- Add more datasets
0.2.1 - New Documentation
- Add mlblocks.datasets module with demo data download functions.
- Extensive documentation, including multiple pipeline examples.
0.2.0 - New MLBlocks API
A new MLBlocks API and Primitive format.
This is a summary of the changes:
- Primitives JSONs and Python code has been moved to a different repository, called MLPrimitives
- Optional usage of multiple JSON primitive folders.
- JSON format has been changed to allow more flexibility and features:
- input and output arguments, as well as argument types, can be specified for each method
- both classes and function as primitives are supported
- multitype and conditional hyperparameters fully supported
- data modalities and primitive classifiers introduced
- metadata such as documentation, description and author fields added
- Parsers are removed, and now the MLBlock class is responsible for loading and reading the JSON primitive.
- Multiple blocks of the same primitive are supported within the same pipeline.
- Arbitrary inputs and outputs for both pipelines and blocks are allowed.
- Shared variables during pipeline execution, usable by multiple blocks.
0.1.9 - Bugfix Release
- Disable some NetworkX functions for incompatibilities with some types of graphs.
0.1.8 - New primitives and some improvements
- Improve the NetworkX primitives.
- Add String Vectorization and Datetime Featurization primitives.
- Refactor some Keras primitives to work with single dimension
y
arrays and be compatible withpickle
. - Add XGBClassifier and XGBRegressor primitives.
- Add some
keras.applications
pretrained networks as preprocessing primitives. - Add helper class to allow function primitives.
0.1.7 - Nested hyperparams dicts
- Support passing hyperparams as nested dicts.
0.1.6 - Text and Graph Pipelines
- Add LSTM classifier and regressor primitives.
- Add OneHotEncoder and MultiLabelEncoder primitives.
- Add several NetworkX graph featurization primitives.
- Add
community.best_partition
primitive.
0.1.5 - Collaborative Filtering Pipelines
- Add LightFM primitive.
0.1.4 - Image pipelines improved
- Allow passing
init_params
onMLPipeline
creation. - Fix bug with MLHyperparam types and Keras.
- Rename
produce_params
aspredict_params
. - Add SingleCNN Classifier and Regressor primitives.
- Simplify and improve Trivial Predictor
0.1.3 - Multi Table pipelines improved
- Improve RandomForest primitive ranges
- Improve DFS primitive
- Add Tree Based Feature Selection primitives
- Fix bugs in TrivialPredictor
- Improved documentation
0.1.2 - Bugfix release
- Fix bug in TrivialMedianPredictor
- Fix bug in OneHotLabelEncoder
0.1.1 - Single Table pipelines improved
- New project structure and primitives for integration into MIT-TA2.
- MIT-TA2 default pipelines and single table pipelines fully working.
0.1.0
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mlblocks-0.2.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 355dab7b09d43d3e62c2c036bde664f75f387b2ae75135b65a3c0f9bb165c040 |
|
MD5 | a784ae3c5422ce179be1f98262de1363 |
|
BLAKE2b-256 | d22e0bec47c80f44af7d6f363d2a90db7a486a6589757106b0f8e7a5bd05878e |