Skip to main content

Kern Rowduction - A package to reduce the number of rows / undersample the (imbalanced) datasets by graph kernelisation methods.

Project description


Kern-Rowduction: undersampling by graph kernelization

PyPI Latest Release Package Status Build Status License: MIT

What is it ?

Kern-Rowduction is a ready-to-use package to increase the quality of your data by deleting near-duplicates in your data set. This is possible by converting your data set into an oriented graph and extract its quasi-kernel which will represent your reduced data. Using the reduced data instead of your original data will improve the computational & statistical performance of your machine learning algorithm.

Why use it ?

The Kern-Rowduction package has namely the following goals :

  • Increase the quality of a data set
  • Reduce datasets and computational time / cost
  • Undersample imbalanced datasets and over represented cohort
  • Improve statistics and predictive models' performances

Below some use cases of the Kern-Rowduction package :

  • Rebalance the population of 0 and 1 in a binary classification on a imbalanced population with a too large number of 0 by example
  • Undersample over-represented classes for multi classification
  • Reduce the influence of given ranges of values in the case of a regression
  • Reduce the size of datasets without losing its 'significant' values in order to improve computational time / cost
  • Improve feature engineering and machine learning models in general

Installation

The source code is currently hosted on GitHub at: https://github.com/kern-rowduction/kern-rowduction

Binary installers for the latest released version are available at the Python Package Index (PyPI) :

pip install kern_rowduction

Dependencies

Documentation

The official documentation is hosted on Github: https://kern-rowduction.github.io/Kern-Rowduction/

Sample Usage

Below an example of usage where a given simple DataFrame is 'rowductioned':

import kern_rowduction as krd
import pandas as pd

df = pd.DataFrame(
  {
  'A': [20 ,21, 6, 5, 6, 91],
  'B': [11, 12, 1, 14, 113, 1],
  'C': [51, 50, 2, 21, 40, 95],
  'D': [63, 65, 54, 12, 70, 98],
  'Label': [0, 0, 1, 1, 1, 0]
  },
  index = ['0', '1', '2','3','4','5'])

rowductioned_df = krd.rowduct(df,rowduction_target=[0,1],\
  epsilon=0.5,label_col='Label',rowduction_method='separately',remove_isolated_points=False)

Getting Help

If you have usage questions or you found bugs, the best place to go to is here, by creating an issue. For other reasons, you can send an email to kern.rowduction@gmail.com.

Contributing to Kern-Rowduction

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. Most development discussions take place on GitHub in this repo or by email between the contributors.

In order to :

  • test the code : execute make test in the root folder.
  • lint the code : execute make lint in the root folder.
  • update the Sphinx documentation : execute make html in the docs folder.

Feel free to ask questions or to make suggestions, you're welcome !

License

Copyright (c) 2021, Kern-Rowduction. Work released under MIT License.

Initial authors :

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Kern-Rowduction-0.0.4.tar.gz (12.4 kB view details)

Uploaded Source

File details

Details for the file Kern-Rowduction-0.0.4.tar.gz.

File metadata

  • Download URL: Kern-Rowduction-0.0.4.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for Kern-Rowduction-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ec20470a2b124cf45011a9429eff81420c396aa9b0c835c589ea49529f5e2319
MD5 24a510faefdef7cb70580aa51fa693ef
BLAKE2b-256 cb4e0aebd202341a7917b708020d99988fb30752ae3d52560526a6cd5c60f95c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page