Skip to main content

Similarity-Based Stratified Splitting Algorithm

Project description

Similarity Stratified Split

Implementation of the Similarity-Based Stratified Splitting algorithm described in Similarity Based Stratified Splitting: an approach to train better classifiers.

Overview

The authors propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split a dataset. Splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications.

Install

PyPI

pip install sbss

Local

git clone https://github.com/timothyckl/similarity-stratified-split.git
cd ./similarity-stratified-split
pip install -e .

Usage

import numpy as np
from scipy.spatial import distance
from sbss import SimilarityStratifiedSplit

def get_distances(x):
    distances = distance.squareform(distance.pdist(x, metric='euclidean'))
    return distances

# inputs are recommended to be normalized
X = np.random.rand(1000, 128)
y = np.random.randint(0, 10, (1000,))

n_splits = 3
s = SimilarityStratifiedSplit(n_splits, dist_func=get_distances)

for train_index, test_index in s.split(X, y):
  print(f"Train indices: {train_index}\nTest indices: {test_index}")
  print("="*100)

sklearn Compatibility

SimilarityStratifiedSplit is compatible with sklearn's cross-validation utilities. It can be passed directly to cross_val_score, GridSearchCV, and similar tools:

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

splitter = SimilarityStratifiedSplit(n_splits=3, dist_func=get_distances)
cv_scores = cross_val_score(SVC(), X, y, cv=splitter)

Note that y is always required by the SBSS algorithm — passing y=None will raise a ValueError.

References

  • Farias, F., Ludermir, T. and Bastos-Filho, C. (2020) Similarity based stratified splitting: An approach to train better classifiers, arXiv.org. Available at: https://arxiv.org/abs/2010.06099 (Accessed: 27 November 2023).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sbss-0.0.5.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sbss-0.0.5-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file sbss-0.0.5.tar.gz.

File metadata

  • Download URL: sbss-0.0.5.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for sbss-0.0.5.tar.gz
Algorithm Hash digest
SHA256 b2f1804cf132ec10f0f5054bb8f9785ff33bd3efc4990dffb98ee363a44555b8
MD5 aef81162bc7816c55ee088d7738d5da0
BLAKE2b-256 5fb2f546b1ce0c2f64944f7ffcad63e378a67726b4d6d9bb74888b1e6c5f5077

See more details on using hashes here.

File details

Details for the file sbss-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: sbss-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for sbss-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 315457b3879b7d3a57a262b18daa8dd608dcab6a2733db86e5d8e269d50ab96c
MD5 8bfc0e63bb8d60e52079c1cf83fd12da
BLAKE2b-256 b196a80f66b160f7d7a324dcd374c1c09b4a4d0bb0a1dd4cf77082a29e014e9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page