Skip to main content

Pandas Adapters For Scikit-Learn

Project description

Ami Tavory, Shahar Azulay, Tali Raveh-Sadka

https://travis-ci.org/atavory/ibex.svg?branch=master https://landscape.io/github/atavory/ibex/master/landscape.svg?style=flat https://img.shields.io/codecov/c/github/atavory/ibex/master.svg http://readthedocs.org/projects/ibex/badge/?version=latest https://img.shields.io/badge/license-BSD--3--Clause-brightgreen.svg https://badge.fury.io/py/ibex.svg

This library aims for two (somewhat independent) goals:

(You might also want to check out the excellent pandas-sklearn which has the same aims, but takes a very different approach.)

The full documentation at rtd_ibex defines these matters in detail, but the library has an extremely-small interface.

TL;DR

The following short example shows the main points of the library. It is an adaptation of the scikit-learn example Concatenating multiple feature extraction methods. In this example, we build a classifier for the iris dataset using a combination of PCA, univariate feature selection, and a support vecor machine classifier.

We first load the Iris dataset into a pandas DataFrame.

>>> import numpy as np
>>> from sklearn import datasets
>>> import pandas as pd
>>>
>>> iris = datasets.load_iris()
>>> features, targets, iris = iris['feature_names'], iris['target_names'], pd.DataFrame(
...     np.c_[iris['data'], iris['target']],
...     columns=iris['feature_names']+['class'])
>>> iris['class'] = iris['class'].map(pd.Series(targets))
>>>
>>> iris.head()
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
<BLANKLINE>
    class
0  setosa
1  setosa
2  setosa
3  setosa
4  setosa

Now, we import the relevant steps. Note that, in this example, we import them from ibex.sklearn rather than sklearn.

>>> from ibex.sklearn.svm import SVC as PdSVC
>>> from ibex.sklearn.feature_selection import SelectKBest as PdSelectKBest
>>> from ibex.sklearn.decomposition import PCA as PdPCA

(Of course, it’s possible to import steps from sklearn as well, and use them alongside and together with the steps of ibex.sklearn.)

Finally, we construct a pipeline that, given a DataFrame of features:

  • horizontally concatenates a 2-component PCA DataFrame, and the best-feature DataFrame, to a resulting DataFrame

  • then, passes the result to a support-vector machine classifier outputting a pandas series:

    >>> clf = PdPCA(n_components=2) + PdSelectKBest(k=1) | PdSVC(kernel="linear")
    

clf is now a pandas-ware classifier, but otherwise can be used pretty much like all sklearn estimator. For example,

>>> param_grid = dict(
...     featureunion__pca__n_components=[1, 2, 3],
...     featureunion__selectkbest__k=[1, 2],
...     svc__C=[0.1, 1, 10])
>>> try:
...     from ibex.sklearn.model_selection import GridSearchCV as PdGridSearchCV
... except: # Accomodate older versions of sklearn
...     from ibex.sklearn.grid_search import GridSearchCV as PdGridSearchCV
>>> PdGridSearchCV(clf, param_grid=param_grid).fit(iris[features], iris['class']) # doctest: +SKIP
...

So what does this add to the original version?

  1. The estimators perform verification and processing on the inputs and outputs. They verify column names following calls to fit, and index results according to those of the inputs. This helps catch bugs.

  2. The results are much more interpretable:

    >>> svc = PdSVC(kernel="linear", probability=True)
    

    Find the coefficients of the boundaries between the different classes:

    >>> svc.fit(iris[features], iris['class']).coef_
                sepal length (cm)  sepal width (cm)  petal length (cm)  \
    setosa              -0.046259          0.521183          -1.003045
    versicolor          -0.007223          0.178941          -0.538365
    virginica            0.595498          0.973900          -2.031000
    <BLANKLINE>
                petal width (cm)
    setosa             -0.464130
    versicolor         -0.292393
    virginica          -2.006303
    

    Predict belonging to classes:

    >>> svc.fit(iris[features], iris['class']).predict_proba(iris[features])
        setosa  versicolor  virginica
    0    0.97...    0.01...   0.00...
    ...
    

    Find the coefficients of the boundaries between the different classes in a pipeline:

    >>> clf = PdPCA(n_components=2) + PdSelectKBest(k=1) | svc
    >>> clf = clf.fit(iris[features], iris['class'])
    >>> svc.coef_
                    pca                 selectkbest
                comp_0    comp_1 petal length (cm)
    setosa     -0.757016  ...0.376680         -0.575197
    versicolor -0.351218  ...0.141699         -0.317562
    virginica  -1.529320  ...1.472771         -1.509391
    
  3. It allows writinfitg Pandas-munging estimators (see also Multiple-Row Features In The Movielens Dataset).

  4. Using DataFrame metadata, it allows writing more complex meta-learning algorithms, such as stacking and nested labeled and stratified cross validation.

  5. The pipeline syntax is succinct and clear (see Motivation For Shorter Combinations).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ibex-0.1.3.tar.gz (24.5 kB view details)

Uploaded Source

File details

Details for the file ibex-0.1.3.tar.gz.

File metadata

  • Download URL: ibex-0.1.3.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for ibex-0.1.3.tar.gz
Algorithm Hash digest
SHA256 989db0004432c50e237c34d99912d10c00552e2bb57e162de11799378d9bd40e
MD5 7d6fe430b0dd5f985d6e4ef23e175786
BLAKE2b-256 aab66f018cc4f13b230775a7d1f0028e0132987e7f7743472008d16af75b1843

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page