Pyplearnr is a tool designed to easily and more elegantly build, validate (nested k-fold cross-validation), and test scikit-learn pipelines.
Project description
# What
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.
# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.
### Installation
##### Dependencies
pyplearnr requires:
Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)
For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).
### User installation
Install by using pip:
```
pip install pyplearnr
```
For conda, you can issue the same command above within a conda environment or you can include in your environment.yml file this:
```
- pip:
- git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```
and then either generate a new environment from the terminal using:
```
conda env create
```
or update an existing one (environment_name) using:
```
conda env update -n=environment_name -f=./environment.yml
```
Another option is to simply clone the respository, link to the location in your code, and import it.
# bleh
One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones.
Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
```
The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.
# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.
# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.
### Installation
##### Dependencies
pyplearnr requires:
Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)
For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).
### User installation
Install by using pip:
```
pip install pyplearnr
```
For conda, you can issue the same command above within a conda environment or you can include in your environment.yml file this:
```
- pip:
- git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```
and then either generate a new environment from the terminal using:
```
conda env create
```
or update an existing one (environment_name) using:
```
conda env update -n=environment_name -f=./environment.yml
```
Another option is to simply clone the respository, link to the location in your code, and import it.
# bleh
One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones.
Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
```
The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.
# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyplearnr-1.0.10.1.tar.gz
(27.5 kB
view hashes)
Built Distribution
Close
Hashes for pyplearnr-1.0.10.1-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b7182751bd7a1e42bdbb16e47c67abe65e3c954d50cb5496ff5aa9859ec3054 |
|
MD5 | ea982d753375be98140b53f0b9697bf1 |
|
BLAKE2b-256 | 530e5706bd19a33aa56c4c6fd891350ecbda615382bc410130fa9cb70c301068 |