Skip to main content

Find threshold for fine-tuning output from predict_proba

Project description

Thresher - THRESHold EvaluatoR for Python

A bare pandas implementation of a tool for finding the threshold which maximizes accuracy of predict_proba like-outputs (from e.g. scikit-learn), in regard to the provided ground truth (labels).

Note: you can jump directly to the sample usage here.

Project description

Method interesting for the user is optimize_threshold(scores, actual_classes), which is available from the Thresher class. This method, for given scores and actual classes, returns a threshold that yields the highest fraction of correctly classified samples.

optimize_threshold parameters:
  scores​:list
    The list of scores.
  actual_classes​:list
    The list of ground truth (correct) classes. 
    Classes are represented as -1 and 1.
returns:
  threshold:​float
    The threshold value that yields ​the highest fraction of correctly classified 
    samples​. If multiple thresholds give the optimal fraction, return any threshold.

An oracle mechanism

We implemented a meta-optimizer - an 'oracle' mechanism, which chooses a proper algorithm in regard to the provided data. This is the default behaviour, and can be controlled by changing the algorithm param of the Thresher constructor. See the source code of oracle.py and interface.py for more details.

Implemented algorithms

Linear search

This is the most basic, iterative approach. Recommended for smaller datasets. For every threshold present in the input (in the scores list), we evaluate it by calculating the exact accuracy of split produced by such threshold. Then, return the threshold which produce the most accurate split.

List of parameters to customize:

  • n_jobs (default: 1) - set to -1 for using all available processors except one; any value of 2 or more enables multiprocessing, while the default value of 1 disables multiprocessing

2-dim Stochastic Gradient Descent

tbd

List of parameters to customize:

  • num_of_iters (default: 200) - number of iterations during which algorithm tries to converge
  • stop_thresh (default: 0.001) - minimal value of improvement, below which algorithm stops
  • alpha (default: 0.01)

Evolutionary algorithm

This is a simulation approach which uses an evolutionary algorithm. It works by simulating multiple generations of a "population" of candidate solutions. During every iteration of a single generation, algorithm stochasticly evaluates the candidate solution. After the end of a single generation, we remove the from the population least fit agents (solutions), and do the crossover between the left solitions to produce new "offspring" candidate solutions. Moreover, they may mutate to provide additional random chance.

List of parameters to customize:

  • population_size (default: 30) - number of agents in the simulation
  • number_of_generations (default: 20) - number of generations
  • number_of_iterations (default: 10) - number of iterations per a generation
  • sus_factor (default: 2) - how many least-fit agents should be childless at the end of generation
  • stoch_ratio (default: 0.02) - percentage of data to evaluate fit of a single agent per iteration
  • optimized_start (default: True)
  • mutation_chance (default: 0.05)
  • mutation_factor (default: 0.10)

Grid search

Added in version 0.1.2. This algorithm works by generate a grid of possible solutions, with a granularity set by parameter named no_of_decimal_places. All candidate solutions are evaluated thoroughly and the best one is chosen at the end.

List of parameters to customize:

  • no_of_decimal_places (default: 2) - generate the grid by rounding the number to the given number of decimal places

Stochastic Grid search

Added in version 0.1.2. This algorithm works similarly like the above-mentioned 'Grid search' method, with the difference, that every single point generated by the grid is evaluated only partially (which can be controlled by the stoch_ratio parameter)

List of parameters to customize:

  • no_of_decimal_places (default: 2) - generate the grid by rounding the number to the given number of decimal places
  • stoch_ratio (default: 0.05) - percentage of data to evaluate fit of a candidate number in the grid
  • reshuffle (default: False) - set whether the random projection should be calculated every step, or not

How to setup?

The process is rather straightforward, you just need to just whether to install from the sources (latest revision), or from the PyPI repository (stable release).

Requirements

Tested with Python 3.7+, on a standard Unix environment

Installation

Installation from source:

pip install git+https://github.com/oskar-j/thresher.git

Stable release using the pip tool:

pip install thresher-py

Custom parameters

It's possible to provide additional parameters in the Thresher constructor.

Thresher(algorithm='auto',
         allow_parallel=True,
         verbose=False, 
         progress_bar=False,
         labels=(0,1))

Here is a description of what does every particular parameter do:

  • algorithm (default value: 'auto') - allows to manually choose the algorithm from the list of available algorithms. Same effect can be achieved with running the method called set_algorithm(algorithm_name) on the Thresher instance. The default value is 'auto', which means that the tool uses an oracle mechanism to manually choose a proper algorithm.
  • allow_parallel (default value: True) - enables/disabled multiprocessing for algorithms
  • verbose (default value: False) - enables verbosity
  • progress_bar (default value: False) - shows a progress bar in the terminal (if supported by the algorithm)
  • labels - necessary if your labels are different from (-1, 1) - first item from the tuple/list is a negative label, and the second item is a positive label

Control parameters for the algorithms

Some of the above-mentioned algorithms allow to change their parameters. They should be provided in a dictionary, inside the algorithm_params parameter. If no such customs parameters are provided, default values apply.

Examples:

t = thresher.Thresher(algorithm_params={'n_jobs': 3})
t = thresher.Thresher(algorithm_params={'no_of_decimal_places': 3,
                                        'stoch_ratio': 0.10})

Sample usage

import thresher

t = thresher.Thresher()

print('Currently supported algorithms:')
print(t.get_supported_algorithms())

cases = [0.1, 0.3, 0.4, 0.7]
actual_labels = [-1, -1, 1, 1]

print(f'Optimization result: {t.optimize_threshold(cases, actual_labels)}')

See the examples directory for more sample code.

Performance tests

A very basic performance test (with 10 repeats, on a real-world anonymized data consisting of 10^6 rows) can be found in the Notebook located here.

Future work

  • adding more algorithms,
  • publishing on conda,
  • more heavy test loads,
  • python docs,
  • CI/CD pipeline for automated tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thresher-py-0.1.2.tar.gz (13.8 kB view details)

Uploaded Source

File details

Details for the file thresher-py-0.1.2.tar.gz.

File metadata

  • Download URL: thresher-py-0.1.2.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.4.3 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.0

File hashes

Hashes for thresher-py-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6e5442516e298377ba1ddb66c7172a43b926535ad18ba4fc63d6b1a9dadcf460
MD5 b6291da65b7c6dbe5f48100c13c7ba2b
BLAKE2b-256 e18fcfed3c74dd9d290d1aece9364bc2ff0c34996db717a2ed526cb15c7ffb95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page