Skip to main content

Pipelines and primitives for machine learning and data science.

Project description

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

“MLBlocks”

Pipelines and Primitives for Machine Learning and Data Science.

Development Status PyPi Tests CodeCov Downloads Binder


MLBlocks

Overview

MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface.

Features include:

  • Build Machine Learning Pipelines combining any Machine Learning Library in Python.
  • Access a repository with hundreds of primitives and pipelines ready to be used with little to no python code to write, carefully curated by Machine Learning and Domain experts.
  • Extract machine-readable information about which hyperparameters can be tuned and within which ranges, allowing automated integration with Hyperparameter Optimization tools like BTB.
  • Complex multi-branch pipelines and DAG configurations, with unlimited number of inputs and outputs per primitive.
  • Easy save and load Pipelines using JSON Annotations.

Install

Requirements

MLBlocks has been developed and tested on Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13

Install with pip

The easiest and recommended way to install MLBlocks is using pip:

pip install mlblocks

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

MLPrimitives

In order to be usable, MLBlocks requires a compatible primitives library.

The official library, required in order to follow the following MLBlocks tutorial, is MLPrimitives, which you can install with this command:

pip install mlprimitives

Quickstart

Below there is a short example about how to use MLBlocks to solve the Adult Census Dataset classification problem using a pipeline which combines primitives from MLPrimitives, scikit-learn and xgboost.

import pandas as pd
from mlblocks import MLPipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

dataset = pd.read_csv('http://mlblocks.s3.amazonaws.com/census.csv')
label = dataset.pop('label')

X_train, X_test, y_train, y_test = train_test_split(dataset, label, stratify=label)

primitives = [
    'mlprimitives.custom.preprocessing.ClassEncoder',
    'mlprimitives.custom.feature_extraction.CategoricalEncoder',
    'sklearn.impute.SimpleImputer',
    'xgboost.XGBClassifier',
    'mlprimitives.custom.preprocessing.ClassDecoder'
]
pipeline = MLPipeline(primitives)

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

accuracy_score(y_test, predictions)

What's Next?

If you want to learn more about how to tune the pipeline hyperparameters, save and load the pipelines using JSON annotations or build complex multi-branched pipelines, please check our documentation site.

Also do not forget to have a look at the notebook tutorials!

Citing MLBlocks

If you use MLBlocks for your research, please consider citing our related papers.

For the current design of MLBlocks and its usage within the larger Machine Learning Bazaar project at the MIT Data To AI Lab, please see:

Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni. "The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development." arXiv Preprint 1905.08942. 2019.

@article{smith2019mlbazaar,
  author = {Smith, Micah J. and Sala, Carles and Kanter, James Max and Veeramachaneni, Kalyan},
  title = {The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development},
  journal = {arXiv e-prints},
  year = {2019},
  eid = {arXiv:1905.08942},
  pages = {arXiv:1905.08942},
  archivePrefix = {arXiv},
  eprint = {1905.08942},
}

For the first MLBlocks version from 2015, designed for only multi table, multi entity temporal data, please refer to Bryan Collazo’s thesis:

With recent availability of a multitude of libraries and tools, we decided it was time to integrate them and expand the library to address other data types: images, text, graph, time series and integrate with deep learning libraries.

Changelog

0.6.1 - 2023-09-26

  • Add python 3.11 to MLBlocks - Issue #143 by @sarahmish

0.6.0 - 2023-04-14

  • Support python 3.9 and 3.10 - Issue #141 by @sarahmish

0.5.0 - 2023-01-22

  • Update numpy dependency and isolate tests - Issue #139 by @sarahmish

0.4.1 - 2021-10-08

  • Update NumPy dependency - Issue #136 by @sarahmish
  • Support dynamic inputs and outputs - Issue #134 by @pvk-developer

0.4.0 - 2021-01-09

  • Stop pipeline fitting after the last block - Issue #131 by @sarahmish
  • Add memory debug and profiling - Issue #130 by @pvk-developer
  • Update Python support - Issue #129 by @csala
  • Get execution time for each block - Issue #127 by @sarahmish
  • Allow loading a primitive or pipeline directly from the JSON path - Issue #114 by @csala
  • Pipeline Diagrams - Issue #113 by @erica-chiu
  • Get Pipeline Inputs - Issue #112 by @erica-chiu

0.3.4 - 2019-11-01

  • Ability to return intermediate context - Issue #110 by @csala
  • Support for static or class methods - Issue #107 by @csala

0.3.3 - 2019-09-09

  • Improved intermediate outputs management - Issue #105 by @csala

0.3.2 - 2019-08-12

  • Allow passing fit and produce arguments as init_params - Issue #96 by @csala
  • Support optional fit and produce args and arg defaults - Issue #95 by @csala
  • Isolate primitives from their hyperparameters dictionary - Issue #94 by @csala
  • Add functions to explore the available primitives and pipelines - Issue #90 by @csala
  • Add primitive caching - Issue #22 by @csala

0.3.1 - Pipelines Discovery

  • Support flat hyperparameter dictionaries - Issue #92 by @csala
  • Load pipelines by name and register them as entry_points - Issue #88 by @csala
  • Implement partial re-fit -Issue #61 by @csala
  • Move argument parsing to MLBlock - Issue #86 by @csala
  • Allow getting intermediate outputs - Issue #58 by @csala

0.3.0 - New Primitives Discovery

  • New primitives discovery system based on entry_points.
  • Conditional Hyperparameters filtering in MLBlock initialization.
  • Improved logging and exception reporting.

0.2.4 - New Datasets and Unit Tests

  • Add a new multi-table dataset.
  • Add Unit Tests up to 50% coverage.
  • Improve documentation.
  • Fix minor bug in newsgroups dataset.

0.2.3 - Demo Datasets

  • Add new methods to Dataset class.
  • Add documentation for the datasets module.

0.2.2 - MLPipeline Load/Save

  • Implement save and load methods for MLPipelines
  • Add more datasets

0.2.1 - New Documentation

  • Add mlblocks.datasets module with demo data download functions.
  • Extensive documentation, including multiple pipeline examples.

0.2.0 - New MLBlocks API

A new MLBlocks API and Primitive format.

This is a summary of the changes:

  • Primitives JSONs and Python code has been moved to a different repository, called MLPrimitives
  • Optional usage of multiple JSON primitive folders.
  • JSON format has been changed to allow more flexibility and features:
    • input and output arguments, as well as argument types, can be specified for each method
    • both classes and function as primitives are supported
    • multitype and conditional hyperparameters fully supported
    • data modalities and primitive classifiers introduced
    • metadata such as documentation, description and author fields added
  • Parsers are removed, and now the MLBlock class is responsible for loading and reading the JSON primitive.
  • Multiple blocks of the same primitive are supported within the same pipeline.
  • Arbitrary inputs and outputs for both pipelines and blocks are allowed.
  • Shared variables during pipeline execution, usable by multiple blocks.

0.1.9 - Bugfix Release

  • Disable some NetworkX functions for incompatibilities with some types of graphs.

0.1.8 - New primitives and some improvements

  • Improve the NetworkX primitives.
  • Add String Vectorization and Datetime Featurization primitives.
  • Refactor some Keras primitives to work with single dimension y arrays and be compatible with pickle.
  • Add XGBClassifier and XGBRegressor primitives.
  • Add some keras.applications pretrained networks as preprocessing primitives.
  • Add helper class to allow function primitives.

0.1.7 - Nested hyperparams dicts

  • Support passing hyperparams as nested dicts.

0.1.6 - Text and Graph Pipelines

  • Add LSTM classifier and regressor primitives.
  • Add OneHotEncoder and MultiLabelEncoder primitives.
  • Add several NetworkX graph featurization primitives.
  • Add community.best_partition primitive.

0.1.5 - Collaborative Filtering Pipelines

  • Add LightFM primitive.

0.1.4 - Image pipelines improved

  • Allow passing init_params on MLPipeline creation.
  • Fix bug with MLHyperparam types and Keras.
  • Rename produce_params as predict_params.
  • Add SingleCNN Classifier and Regressor primitives.
  • Simplify and improve Trivial Predictor

0.1.3 - Multi Table pipelines improved

  • Improve RandomForest primitive ranges
  • Improve DFS primitive
  • Add Tree Based Feature Selection primitives
  • Fix bugs in TrivialPredictor
  • Improved documentation

0.1.2 - Bugfix release

  • Fix bug in TrivialMedianPredictor
  • Fix bug in OneHotLabelEncoder

0.1.1 - Single Table pipelines improved

  • New project structure and primitives for integration into MIT-TA2.
  • MIT-TA2 default pipelines and single table pipelines fully working.

0.1.0

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlblocks-0.6.2.dev0.tar.gz (80.3 kB view details)

Uploaded Source

Built Distribution

mlblocks-0.6.2.dev0-py2.py3-none-any.whl (25.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mlblocks-0.6.2.dev0.tar.gz.

File metadata

  • Download URL: mlblocks-0.6.2.dev0.tar.gz
  • Upload date:
  • Size: 80.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.11.2 readme-renderer/43.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.67.0 importlib-metadata/4.13.0 keyring/25.5.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.15

File hashes

Hashes for mlblocks-0.6.2.dev0.tar.gz
Algorithm Hash digest
SHA256 329a33fc671551105afa59efcc6ecaa442354f25f0eeaf3beeb92dc579e52214
MD5 42d8516fb835508331c4a2df52917b63
BLAKE2b-256 5f48198c87d4c5544c60febdbddfc8926a7652e11ebe485301ccb16caf01680f

See more details on using hashes here.

File details

Details for the file mlblocks-0.6.2.dev0-py2.py3-none-any.whl.

File metadata

  • Download URL: mlblocks-0.6.2.dev0-py2.py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.11.2 readme-renderer/43.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.2.3 tqdm/4.67.0 importlib-metadata/4.13.0 keyring/25.5.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.15

File hashes

Hashes for mlblocks-0.6.2.dev0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cdabf2b1e9be62cdb5f07bd1994e0b549edbf96915e9b85d704ccb80045311af
MD5 54783115c9c8b2e325bf5fa4e7f1f405
BLAKE2b-256 465ee87d1b8424f0343371ad41fa2b35cee817b7b1c16df28df391d94e91551d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page