A library to parse PMML models into Scikit-learn estimators.
Project description
sklearn-pmml-model
A Python library that provides import functionality to all major estimator classes of the popular machine learning library scikit-learn using PMML.
Installation
The easiest way is to use pip:
$ pip install sklearn-pmml-model
Status
This library is in beta, and currently not all models are supported. The library currently does support the following models:
Model | Classification | Regression | Categorical features |
---|---|---|---|
Decision Trees | ✅ | ✅ | ✅1 |
Random Forests | ✅ | ✅ | ✅1 |
Gradient Boosting | ✅ | ✅ | ✅1 |
Linear Regression | ✅ | ✅ | ✅3 |
Ridge | ✅2 | ✅ | ✅3 |
Lasso | ✅2 | ✅ | ✅3 |
ElasticNet | ✅2 | ✅ | ✅ |
Gaussian Naive Bayes | ✅ | ✅3 | |
Support Vector Machines | ✅ | ✅ | ✅3 |
1 Categorical feature support using slightly modified internals, based on scikit-learn#12866.
2 These models differ only in training characteristics, the resulting model is of the same form. Classification is supported using PMMLLogisticRegression
for regression models and PMMLRidgeClassifier
for general regression models.
3 By one-hot encoding categorical features automatically.
The following part of the specification is covered:
- Array (including typed variants)
- SparseArray (including typed variants)
- Indices
- Entries (including typed variants)
- DataDictionary
- DataField (continuous, categorical, ordinal)
- Value
- Interval
- DataField (continuous, categorical, ordinal)
- TransformationDictionary / LocalTransformations
- DerivedField
- TreeModel
- SimplePredicate
- SimpleSetPredicate
- Segmentation ('majorityVote' for Random Forests, 'modelChain' and 'sum' for Gradient Boosting)
- Regression
- RegressionTable
- NumericPredictor
- CategoricalPredictor
- RegressionTable
- GeneralRegressionModel (only linear models)
- PPMatrix
- PPCell
- ParamMatrix
- PCell
- PPMatrix
- NaiveBayesModel
- BayesInputs
- BayesInput
- TargetValueStats
- TargetValueStat
- GaussianDistribution
- TargetValueStat
- PairCounts
- TargetValueCounts
- TargetValueCount
- TargetValueCounts
- TargetValueStats
- BayesInput
- BayesInputs
- SupportVectorMachineModel
- LinearKernelType
- PolynomialKernelType
- RadialBasisKernelType
- SigmoidKernelType
- VectorDictionary
- VectorFields
- VectorInstance
- SupportVectorMachine
- SupportVectors
- SupportVector
- Coefficients
- Coefficient
- SupportVectors
Example
A minimal working example is shown below:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn_pmml_model.ensemble import PMMLForestClassifier
# Prepare data
iris = load_iris()
X = pd.DataFrame(iris.data)
X.columns = np.array(iris.feature_names)
y = pd.Series(np.array(iris.target_names)[iris.target])
y.name = "Class"
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33, random_state=123)
clf = PMMLForestClassifier(pmml="models/randomForest.pmml")
clf.predict(Xte)
clf.score(Xte, yte)
More examples can be found in the subsequent packages: tree, ensemble, linear_model and naive_bayes.
Benchmark
Depending on the data set and model, sklearn-pmml-model
is between 5 and a 1000 times faster than competing libraries, by leveraging the optimization and industry-tested robustness of sklearn
. Source code for this benchmark can be found in the corresponding jupyter notebook.
Running times (load + predict, in seconds)
Linear model | Naive Bayes | Decision tree | Random Forest | Gradient boosting | ||
---|---|---|---|---|---|---|
Wine | PyPMML |
0.773291 | 0.77384 | 0.777425 | 0.895204 | 0.902355 |
sklearn-pmml-model |
0.005813 | 0.006357 | 0.002693 | 0.108882 | 0.121823 | |
Breast cancer | PyPMML |
3.849855 | 3.878448 | 3.83623 | 4.16358 | 4.13766 |
sklearn-pmml-model |
0.015723 | 0.011278 | 0.002807 | 0.146234 | 0.044016 |
Improvement
Linear model | Naive Bayes | Decision tree | Random Forest | Gradient boosting | ||
---|---|---|---|---|---|---|
Wine | Improvement | 133× | 122× | 289× | 8× | 7× |
Breast cancer | Improvement | 245× | 344× | 1,367× | 28× | 94× |
Development
Prerequisites
Tests can be run using Py.test. Grab a local copy of the source:
$ git clone http://github.com/iamDecode/sklearn-pmml-model
$ cd sklearn-pmml-model
create a virtual environment and activating it:
$ python3 -m venv venv
$ source venv/bin/activate
and install the dependencies:
$ pip install -r requirements.txt
The final step is to build the Cython extensions:
$ python setup.py build_ext --inplace
Testing
You can execute tests with py.test by running:
$ python setup.py pytest
Contributing
Feel free to make a contribution. Please read CONTRIBUTING.md for more details.
License
This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.