Skip to main content

A method for selecting samples by spreading the training data evenly.

Project description

Kennard Stone

python_badge license_badge PyPI version Downloads

Test on each version Code style: black Anaconda-Server Badge Anaconda-platform badge

What is this?

This is an algorithm for evenly partitioning data in a scikit-learn-like interface. (See References for details of the algorithm.)

simulateion_gif

How to install

PyPI

pip install kennard-stone

The project site is here.

Anaconda

conda install -c conda-forge kennard-stone

The project site is here.

You need numpy and scikit-learn to run.

How to use

You can use them like scikit-learn.

See example for details.

In the following, X denotes an arbitrary explanatory variable and y an arbitrary objective variable. And, estimator indicates an arbitrary prediction model that conforms to scikit-learn.

train_test_split

kennard_stone

from kennard_stone import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scikit-learn

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=334
)

KFold

kennard_stone

from kennard_stone import KFold

# Always shuffled and uniquely determined for a data set.
kf = KFold(n_splits=5)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

scikit-learn

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=334)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

Other usages

If you ever specify cv in scikit-learn, you can assign KFold objects to it and apply it to various functions.

An example is cross_validate.

kennard_stone

from kennard_stone import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5)
print(cross_validate(estimator, X, y, cv=kf))

scikit-learn

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5, shuffle=True, random_state=334)
print(cross_validate(estimator, X, y, cv=kf))

OR

from sklearn.model_selection import cross_validate

print(cross_validate(estimator, X, y, cv=5))

Notes

There is no notion of random_state or shuffle because the partitioning is determined uniquely for the dataset. If these arguments are included, they do not cause an error. They simply have no effect on the result. Please be careful.

If you want to run the notebook in example directory, you will need to additionally download pandas, matplotlib, seaborn, tqdm, and jupyter other than the packages in requirements.txt.

Parallelization (since v2.1.0)

This algorithm is very computationally intensive and takes a lot of time. To solve this problem, I have implemented parallelization and optimized the algorithm since v2.1.0. n_jobs can be specified for parallelization as in the scikit-learn-like api.

# parallelization KFold
kf = KFold(n_splits=5, n_jobs=-1)

# parallelization train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, n_jobs=-1
)

LICENSE

MIT Licence

Copyright (c) 2021 yu9824

References

Papers

Sites

Histories

v2.0.0

  • Define Extended Kennard-Stone algorithm (multi-class) i.e. Improve KFold algorithm.
  • Delete alternate argument in KFold.
  • Delete requirements of pandas.

v2.0.1

  • Fix bug with Python3.7.

v2.1.0

  • Optimize algorithm
  • Deal with Large number of data.
    • parallel calculation when calculating distance (Add n_jobs argument)
    • replacing recursive functions with for-loops
  • Add other than "euclidean" calculation methods (Add metric argument)

v2.1.1

  • Fix bug when metric="nan_euclidean".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kennard_stone-2.1.1.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kennard_stone-2.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file kennard_stone-2.1.1.tar.gz.

File metadata

  • Download URL: kennard_stone-2.1.1.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.12

File hashes

Hashes for kennard_stone-2.1.1.tar.gz
Algorithm Hash digest
SHA256 cd662b713f1ca42fc1edc741fbc4e9628cb72f7bb887c0a0300004b56c4b755e
MD5 3b37b84d90d3b513bbe4f51c254e3626
BLAKE2b-256 52971bc1d24bcc6d3cbd41eabda6dde572c0966c12adcf19026b3cd835a5a7a0

See more details on using hashes here.

File details

Details for the file kennard_stone-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: kennard_stone-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.12

File hashes

Hashes for kennard_stone-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 97968f21d559a7242ae251c64d25ca6f96396359e6ac57561fc636ff500b0db4
MD5 ab63ede4479944a795f8008a26f34ef4
BLAKE2b-256 ba5e612f1a4fc783255aa7787c3adebbc21c743cbf1e7103f31d9ad9fe3cf10d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page