Similarity-Based Stratified Splitting Algorithm
Project description
Similarity Stratified Split
Implementation of the Similarity-Based Stratified Splitting algorithm described in Similarity Based Stratified Splitting: an approach to train better classifiers.
Overview
The authors propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split a dataset. Splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications.
Install
PyPI
pip install sbss
Local
git clone https://github.com/timothyckl/similarity-stratified-split.git
cd ./similarity-stratified-split
pip install -e .
Usage
import numpy as np
from scipy.spatial import distance
from sbss import SimilarityStratifiedSplit
def get_distances(x):
distances = distance.squareform(distance.pdist(x, metric='euclidean'))
return distances
# inputs are recommended to be normalized
X = np.random.rand(1000, 128)
y = np.random.randint(0, 10, (1000,))
n_splits = 3
s = SimilarityStratifiedSplit(n_splits, dist_func=get_distances)
for train_index, test_index in s.split(X, y):
print(f"Train indices: {train_index}\nTest indices: {test_index}")
print("="*100)
sklearn Compatibility
SimilarityStratifiedSplit is compatible with sklearn's cross-validation utilities. It can be passed directly to cross_val_score, GridSearchCV, and similar tools:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
splitter = SimilarityStratifiedSplit(n_splits=3, dist_func=get_distances)
cv_scores = cross_val_score(SVC(), X, y, cv=splitter)
Note that y is always required by the SBSS algorithm — passing y=None will raise a ValueError.
References
- Farias, F., Ludermir, T. and Bastos-Filho, C. (2020) Similarity based stratified splitting: An approach to train better classifiers, arXiv.org. Available at: https://arxiv.org/abs/2010.06099 (Accessed: 27 November 2023).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sbss-0.0.5.tar.gz.
File metadata
- Download URL: sbss-0.0.5.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2f1804cf132ec10f0f5054bb8f9785ff33bd3efc4990dffb98ee363a44555b8
|
|
| MD5 |
aef81162bc7816c55ee088d7738d5da0
|
|
| BLAKE2b-256 |
5fb2f546b1ce0c2f64944f7ffcad63e378a67726b4d6d9bb74888b1e6c5f5077
|
File details
Details for the file sbss-0.0.5-py3-none-any.whl.
File metadata
- Download URL: sbss-0.0.5-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
315457b3879b7d3a57a262b18daa8dd608dcab6a2733db86e5d8e269d50ab96c
|
|
| MD5 |
8bfc0e63bb8d60e52079c1cf83fd12da
|
|
| BLAKE2b-256 |
b196a80f66b160f7d7a324dcd374c1c09b4a4d0bb0a1dd4cf77082a29e014e9f
|