Kern Rowduction - A package to reduce the number of rows / undersample the (imbalanced) datasets by graph kernelisation methods.
Project description
Kern-Rowduction: undersampling by graph kernelization
What is it ?
Kern-Rowduction is a ready-to-use package to increase the quality of your data by deleting near-duplicates in your data set. This is possible by converting your data set into an oriented graph and extract its quasi-kernel which will represent your reduced data. Using the reduced data instead of your original data will improve the computational & statistical performance of your machine learning algorithm.
Why use it ?
The Kern-Rowduction package has namely the following goals :
- Increase the quality of a data set
- Reduce datasets and computational time / cost
- Undersample imbalanced datasets and over represented cohort
- Improve statistics and predictive models' performances
Below some use cases of the Kern-Rowduction package :
- Rebalance the population of 0 and 1 in a binary classification on a imbalanced population with a too large number of 0 by example
- Undersample over-represented classes for multi classification
- Reduce the influence of given ranges of values in the case of a regression
- Reduce the size of datasets without losing its 'significant' values in order to improve computational time / cost
- Improve feature engineering and machine learning models in general
Installation
The source code is currently hosted on GitHub at: https://github.com/kern-rowduction/kern-rowduction
Binary installers for the latest released version are available at the Python Package Index (PyPI) :
pip install kern_rowduction
Dependencies
Documentation
The official documentation is hosted on Github: https://kern-rowduction.github.io/Kern-Rowduction/
Sample Usage
Below an example of usage where a given simple DataFrame is 'rowductioned':
import kern_rowduction as krd
import pandas as pd
df = pd.DataFrame(
{
'A': [20 ,21, 6, 5, 6, 91],
'B': [11, 12, 1, 14, 113, 1],
'C': [51, 50, 2, 21, 40, 95],
'D': [63, 65, 54, 12, 70, 98],
'Label': [0, 0, 1, 1, 1, 0]
},
index = ['0', '1', '2','3','4','5'])
rowductioned_df = krd.rowduct(df,rowduction_target=[0,1],\
epsilon=0.5,label_col='Label',rowduction_method='separately',remove_isolated_points=False)
Getting Help
If you have usage questions or you found bugs, the best place to go to is here, by creating an issue. For other reasons, you can send an email to kern.rowduction@gmail.com.
Contributing to Kern-Rowduction
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. Most development discussions take place on GitHub in this repo or by email between the contributors.
In order to :
- test the code : execute
make test
in the root folder. - lint the code : execute
make lint
in the root folder. - update the Sphinx documentation : execute
make html
in the docs folder.
Feel free to ask questions or to make suggestions, you're welcome !
License
Copyright (c) 2021, Kern-Rowduction. Work released under MIT License.
Initial authors :
- Hichem Boughattas : hichem.boughattas@protonmail.com
- Hamza Bouanani : h.bouanani97@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file Kern-Rowduction-0.0.4.tar.gz
.
File metadata
- Download URL: Kern-Rowduction-0.0.4.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec20470a2b124cf45011a9429eff81420c396aa9b0c835c589ea49529f5e2319 |
|
MD5 | 24a510faefdef7cb70580aa51fa693ef |
|
BLAKE2b-256 | cb4e0aebd202341a7917b708020d99988fb30752ae3d52560526a6cd5c60f95c |