Find threshold for fine-tuning output from predict_proba
Project description
Thresher - THRESHold EvaluatoR for Python
A bare pandas implementation of a tool for finding the threshold which maximizes accuracy
of predict_proba
like-outputs (from e.g. scikit-learn
), in regard to the provided ground truth (labels).
Note: you can jump directly to the sample usage here.
Project description
Method interesting for the user is optimize_threshold(scores, actual_classes)
, which is available
from the Thresher
class. This method, for given scores and actual classes,
returns a threshold that yields the highest fraction of correctly classified samples.
optimize_threshold parameters:
scores:list
The list of scores.
actual_classes:list
The list of ground truth (correct) classes.
Classes are represented as -1 and 1.
returns:
threshold:float
The threshold value that yields the highest fraction of correctly classified
samples. If multiple thresholds give the optimal fraction, return any threshold.
An oracle mechanism
We implemented a meta-optimizer - an 'oracle' mechanism, which chooses a proper algorithm in regard to the provided data. This is the default behaviour, and can be controlled by changing the algorithm
param of the Thresher
constructor. See the source code of oracle.py and interface.py for more details.
Implemented algorithms
Linear search
This is the most basic, iterative approach. Recommended for smaller datasets. For every threshold present in the input (in the scores list), we evaluate it by calculating the exact accuracy of split produced by such threshold. Then, return the threshold which produce the most accurate split.
List of parameters to customize:
n_jobs
(default: 1) - set to-1
for using all available processors except one; any value of2
or more enables multiprocessing, while the default value of1
disables multiprocessing
2-dim Stochastic Gradient Descent
tbd
List of parameters to customize:
num_of_iters
(default: 200) - number of iterations during which algorithm tries to convergestop_thresh
(default: 0.001) - minimal value of improvement, below which algorithm stopsalpha
(default: 0.01)
Evolutionary algorithm
This is a simulation approach which uses an evolutionary algorithm. It works by simulating multiple generations of a "population" of candidate solutions. During every iteration of a single generation, algorithm stochasticly evaluates the candidate solution. After the end of a single generation, we remove the from the population least fit agents (solutions), and do the crossover between the left solitions to produce new "offspring" candidate solutions. Moreover, they may mutate to provide additional random chance.
List of parameters to customize:
population_size
(default: 30) - number of agents in the simulationnumber_of_generations
(default: 20) - number of generationsnumber_of_iterations
(default: 10) - number of iterations per a generationsus_factor
(default: 2) - how many least-fit agents should be childless at the end of generationstoch_ratio
(default: 0.02) - percentage of data to evaluate fit of a single agent per iterationoptimized_start
(default: True)mutation_chance
(default: 0.05)mutation_factor
(default: 0.10)
Grid search
Added in version 0.1.2
. This algorithm works by generate a grid of possible solutions, with a granularity set
by parameter named no_of_decimal_places
. All candidate solutions are evaluated thoroughly
and the best one is chosen at the end.
List of parameters to customize:
no_of_decimal_places
(default: 2) - generate the grid by rounding the number to the given number of decimal places
Stochastic Grid search
Added in version 0.1.2
. This algorithm works similarly like the above-mentioned 'Grid search' method, with the difference, that
every single point generated by the grid is evaluated only partially (which can be controlled by the stoch_ratio
parameter)
List of parameters to customize:
no_of_decimal_places
(default: 2) - generate the grid by rounding the number to the given number of decimal placesstoch_ratio
(default: 0.05) - percentage of data to evaluate fit of a candidate number in the gridreshuffle
(default: False) - set whether the random projection should be calculated every step, or not
How to setup?
The process is rather straightforward, you just need to just whether to install from the sources (latest revision), or from the PyPI repository (stable release).
Requirements
Tested with Python 3.7+
, on a standard Unix environment
Installation
Installation from source:
pip install git+https://github.com/oskar-j/thresher.git
Stable release using the pip
tool:
pip install thresher-py
Custom parameters
It's possible to provide additional parameters in the Thresher
constructor.
Thresher(algorithm='auto',
allow_parallel=True,
verbose=False,
progress_bar=False,
labels=(0,1))
Here is a description of what does every particular parameter do:
- algorithm (default value:
'auto'
) - allows to manually choose the algorithm from the list of available algorithms. Same effect can be achieved with running the method calledset_algorithm(algorithm_name)
on theThresher
instance. The default value is 'auto', which means that the tool uses an oracle mechanism to manually choose a proper algorithm. - allow_parallel (default value:
True
) - enables/disabled multiprocessing for algorithms - verbose (default value:
False
) - enables verbosity - progress_bar (default value:
False
) - shows a progress bar in the terminal (if supported by the algorithm) - labels - necessary if your labels are different from
(-1, 1)
- first item from the tuple/list is a negative label, and the second item is a positive label
Control parameters for the algorithms
Some of the above-mentioned algorithms allow to change their parameters.
They should be provided in a dictionary, inside the algorithm_params
parameter.
If no such customs parameters are provided, default values apply.
Examples:
t = thresher.Thresher(algorithm_params={'n_jobs': 3})
t = thresher.Thresher(algorithm_params={'no_of_decimal_places': 3,
'stoch_ratio': 0.10})
Sample usage
import thresher
t = thresher.Thresher()
print('Currently supported algorithms:')
print(t.get_supported_algorithms())
cases = [0.1, 0.3, 0.4, 0.7]
actual_labels = [-1, -1, 1, 1]
print(f'Optimization result: {t.optimize_threshold(cases, actual_labels)}')
See the examples directory for more sample code.
Performance tests
A very basic performance test (with 10 repeats, on a real-world anonymized data consisting of 10^6
rows) can be found in the Notebook located here.
Future work
- adding more algorithms,
- publishing on conda,
- more heavy test loads,
- python docs,
- CI/CD pipeline for automated tests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file thresher-py-0.1.2.tar.gz
.
File metadata
- Download URL: thresher-py-0.1.2.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.4.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e5442516e298377ba1ddb66c7172a43b926535ad18ba4fc63d6b1a9dadcf460 |
|
MD5 | b6291da65b7c6dbe5f48100c13c7ba2b |
|
BLAKE2b-256 | e18fcfed3c74dd9d290d1aece9364bc2ff0c34996db717a2ed526cb15c7ffb95 |