Skip to main content

A method for selecting samples by spreading the training data evenly.

Project description

Kennard Stone

python_badge license_badge PyPI version Downloads

Test on each version docs release-pypi

Conda Version Conda Platforms

Ruff mypy

What is this?

This is an algorithm for evenly partitioning data in a scikit-learn-like interface. (See References for details of the algorithm.)

simulation_gif

How to install

PyPI

pip install kennard-stone

The project site is here.

Anaconda

conda install -c conda-forge kennard-stone

The project site is here.

You need numpy>=1.20 and scikit-learn to run.

How to use

You can use them like scikit-learn.

See examples for details.

In the following, X denotes an arbitrary explanatory variable and y an arbitrary objective variable. And, estimator indicates an arbitrary prediction model that conforms to scikit-learn.

train_test_split

kennard_stone

from kennard_stone import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scikit-learn

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=334
)

KFold

kennard_stone

from kennard_stone import KFold

# Always shuffled and uniquely determined for a data set.
kf = KFold(n_splits=5)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

scikit-learn

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=334)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

Other usages

If you ever specify cv in scikit-learn, you can assign KFold objects to it and apply it to various functions.

An example is cross_validate.

kennard_stone

from kennard_stone import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5)
print(cross_validate(estimator, X, y, cv=kf))

scikit-learn

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5, shuffle=True, random_state=334)
print(cross_validate(estimator, X, y, cv=kf))

OR

from sklearn.model_selection import cross_validate

print(cross_validate(estimator, X, y, cv=5))

Notes

There is no notion of random_state or shuffle because the partitioning is determined uniquely for the dataset. If these arguments are included, they do not cause an error. They simply have no effect on the result. Please be careful.

If you want to run the notebook in examples directory, you will need to additionally install pandas, matplotlib, seaborn, tqdm, and jupyter other than the packages in requirements.txt.

Distance metrics

See the documentation of

Valid values for metric are:

  • From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']. These metrics support sparse matrix inputs. ['nan_euclidean'] but it does not yet support sparse matrices.
  • From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'] See the documentation for scipy.spatial.distance for details on these metrics. These metrics do not support sparse matrix inputs.

, by default "euclidean"

Parallelization (since v2.1.0)

This algorithm is very computationally intensive and takes a lot of time. To solve this problem, I have implemented parallelization and optimized the algorithm since v2.1.0. n_jobs can be specified for parallelization as in the scikit-learn-like api.

# parallelization KFold
kf = KFold(n_splits=5, n_jobs=-1)

# parallelization train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, n_jobs=-1
)

The parallelization is used when calculating the distance matrix, so it doesn't conflict with something like cross_validate in parallel when using KFold.

# OK: does not conflict each other
cross_validate(estimator, X, y, cv=KFold(5, n_jobs=-1), n_jobs=-1)

Using GPU

If you have a GPU and have installed pytorch, you can use it to calculate Minkowski distances (Manhattan, Euclidean, and Chebyshev distances).

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, device="cuda"
)

LICENSE

MIT Licence

Copyright (c) 2021 yu9824

References

Papers

Sites

Histories

v2.0.0 (deprecated)

  • Define Extended Kennard-Stone algorithm (multi-class) i.e. Improve KFold algorithm.
  • Delete alternate argument in KFold.
  • Delete requirements of pandas.

v2.0.1

  • Fix bug with Python3.7.

v2.1.0 (deprecated)

  • Optimize algorithm
  • Deal with Large number of data.
    • parallel calculation when calculating distance (Add n_jobs argument)
    • replacing recursive functions with for-loops
  • Add other than "euclidean" calculation methods (Add metric argument)

v2.1.1 (deprecated)

  • Fix bug when metric="nan_euclidean".

v2.1.2 (deprecated)

  • Fix details.
    • Update docstrings and typings.

v2.1.3 (deprecated)

  • Fix details.
    • Update some typings. (You have access to a list of strings that can be used in the metric.)

v2.1.4

  • Fix bug when metric=="seuclidean" and "mahalanobis"
    • Add some tests to check all metrics.
  • Add requirements numpy>=1.20

v2.1.5

  • Delete "klusinski" metric to support scipy>=1.11

v2.1.6

  • Improve typing in kennard_stone.train_test_split
  • Add some docstrings.

v2.2.0

  • Supports GPU calculations. (when metric is 'euclidean', 'manhattan', 'chebyshev' and 'minkowski')
  • Supports Python 3.12

v2.2.1

  • Fix setup.cfg
  • Update 'typing'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kennard_stone-3.0.0rc1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

kennard_stone-3.0.0rc1-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file kennard_stone-3.0.0rc1.tar.gz.

File metadata

  • Download URL: kennard_stone-3.0.0rc1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for kennard_stone-3.0.0rc1.tar.gz
Algorithm Hash digest
SHA256 9c19ef729560e477e86ab6efbccc9245747abfa534797d72c127e50f29b9e2fb
MD5 8023e149a14c8efe3772d1300f434022
BLAKE2b-256 32bd8f25acd7aa694407f3f85b92a5290df1c5e26dcd4ca4467ae5709b5beb20

See more details on using hashes here.

Provenance

The following attestation bundles were made for kennard_stone-3.0.0rc1.tar.gz:

Publisher: release-pypi.yml on yu9824/kennard_stone

Attestations:

File details

Details for the file kennard_stone-3.0.0rc1-py3-none-any.whl.

File metadata

File hashes

Hashes for kennard_stone-3.0.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 a43a0a5dbc045968db11ba2d9dad8184ff569832c2dc252472271c29caf4f14d
MD5 977ed96cc3e68184441c9846b5123600
BLAKE2b-256 6299b56e35a079bcee48c9f152e168327bc14c08d436ed62c99f05a61e9e3778

See more details on using hashes here.

Provenance

The following attestation bundles were made for kennard_stone-3.0.0rc1-py3-none-any.whl:

Publisher: release-pypi.yml on yu9824/kennard_stone

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page