ML Oversampler for binary classification tasks based on Artificial Immune Systems
Project description
AIS-Oversampler
This is an Oversampler for handling Class Imbalance in Binary Classification tasks. It uses the AISOv algorithm that is based on Artificial Immune Systems.
The oversampler works for binary classification and can be set to use different mathematical modes of operation that alter how the final resampling data is generated. The AISOv algorithm has been tested on a variety of datasets and performs comparably to other common and proven oversampling techniques.
Installation
The AIS Oversampler requires:
- Python (>= 3)
- Scikit Learn (>= 1.2)
- Pandas (>= 1.4)
How to use it
This is a simple code example:
from AIS_Oversampler import ArtificialImmuneSystem
#create an instance of the oversampler
oversample_AIS = ArtificialImmuneSystem.ArtificialImmuneSystem()
#import a dataset
df = pd.read_csv("../datasets/dataset2.csv",index_col=0)
#separate the features and labels
features = df.drop(["5"], axis=1)
label = df.drop(df.columns[0:-1],axis=1)
#Initialize a classfier
randomForest = RandomForestClassifier()
#call the main function AIS_Resample with the required parameters
df_after_oversampling = oversample_AIS.AIS_Resample(features, label, model = randomForest)
#call the main function AIS_Resample with the required parameters and optional parameters
df_after_oversampling = oversample_AIS.AIS_Resample(preparedDF, labels, max_rounds = 50, stopping_cond = 20, model = randomForest ,K_folds = 5,scorer = 'f1',min_change = 0.005, use_lof = False, mutation_rate = 1.0)
Main Function Parameters
Parameter | Required | Data Type | Purpose |
---|---|---|---|
Features | Yes | A Pandas DataFrame | Contains normalized, scaled data (with columns being either binary, or floats) with the labels removed. |
Labels | Yes | A Pandas DataFrame | Containsthe label data prepared in the same way as the features. |
Model | Yes | A scikit learn classifier | Denotes the model to be used evaluate each population during resampling. Random Forest and Gradient Boosting produced |
Max_rounds | No | Integer (e.g: 50, 100) | The maximum number of rounds/loops that the oversampler will run for |
Stopping_cond | No | Integer (e.g: 20, 50) | The amount of rounds without change before stopping the algorithm |
K_folds | No | Integer (e.g: 3, 5) | The number of segments used during k-fold cross validation |
Scorer | No | A scikit learn scoring function in string format (e.g ‘f1’) | the scoring metric when evaluating a given population |
Min_Change | No | Float (e.g: 0.005, 0.001) | The minimum amount of score change (as a float percentage, 0.001 = 0.1%) required to say that a given population has become a distinct population |
Use_Lof | No | Boolean) | If set to true, theoversampler will use local outlier factor (an outlier detection method) when evaluating antibody populations. This yields better results but increases the runtime |
Mutation_rate | No | Float (e.g: 1.0) | A value that modulates the amount by which antibodies can mutate in a given round |
Note: Optional Parameters have a default value but should be given a value and fine tuned for optimal results
How it works
The AISOv algorithm takes in an Imbalanced dataset and preprocesses it to determine how many antibodies to create and the necessary parameters for the other parts of the function. It then creates an Initial antibody population before entering the Main Program Loop. In the loop, the antibody population is mutated, a fitness function and an evaluation function are applied to find the current best antibody population. Once the termination condition for the loop is met, the current best antibody population is returned.
You can read more about it here: https://docs.google.com/document/d/1JGAjYWz2Wp95ArWZQnXeLTWyc3ArYzahvMBh8H3XbPA/edit?usp=sharing
About
This algorithm was originally developped by myself, Nikhil Pyndiah alongside my teamates Adam Jansen and Jacob King for our Honours project. Our aim with the project was to create an easy to use oversampling algorithm which could be used as a drop in replacement for other similar oversamplers. The original code alongside extensive testing we did can be found in the report I linked to above and the original code can be found here: https://github.com/nikhil815/Artificial-Immune-System-For-Class-Imbalance
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file AIS_Oversampler-1.0.tar.gz
.
File metadata
- Download URL: AIS_Oversampler-1.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b0f2c93296f3cb89def42a3b8c4bff23c1d664e041ea446f042f7cc3c113908 |
|
MD5 | 01c41bf408be2856e7f706efac166a60 |
|
BLAKE2b-256 | d9b154d8f6da2fbd5d49c0510a73fb4b7f623e3b8268a74393044ee9c4e4e2d3 |
File details
Details for the file AIS_Oversampler-1.0-py3-none-any.whl
.
File metadata
- Download URL: AIS_Oversampler-1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b443b279c582e6b7dadee5352460b2e1fa433f2b4fc84df32f633b4824afb42 |
|
MD5 | dbce5d7e08ffebc687411d145b899116 |
|
BLAKE2b-256 | 6c902077dba09864e2ff022c0906bbf83eebde403f4f6a96b0b7e7a96bc758db |