Skip to main content

Pyplearnr is a tool designed to easily and more elegantly build, validate (nested k-fold cross-validation), and test scikit-learn pipelines.

Project description

# What
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.

# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.

### Installation
##### Dependencies

pyplearnr requires:

Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)

For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).

### User installation
Install by using pip:

```
pip install pyplearnr
```

For conda, you can issue the same command above within a conda environment or you can include in your environment.yml file this:

```
- pip:
- git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```

and then either generate a new environment from the terminal using:

```
conda env create
```

or update an existing one (environment_name) using:

```
conda env update -n=environment_name -f=./environment.yml
```

Another option is to simply clone the respository, link to the location in your code, and import it.

# bleh

One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones.

Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
```

The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.

# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyplearnr-1.0.10.1.tar.gz (27.5 kB view hashes)

Uploaded Source

Built Distribution

pyplearnr-1.0.10.1-py2-none-any.whl (33.0 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page