Skip to main content

SOAK splitting utility

Project description

SOAK: Same/Other/All K-fold Cross-Validation

SOAK is designed to estimate the similarity of patterns found across different subsets of a dataset. It extends traditional K-fold cross-validation with "Same," "Other," and "All" splitting strategies to provide a robust measure of pattern similarity.

Usage

Low-level: SOAK split only

import numpy as np
import soakpy

# --- synthetic data ---
X = np.arange(10).reshape(-1, 1)
X = np.append(X, [10, 12, 14])
y = X.ravel()
subset_vec = np.array(['even' if x % 2 == 0 else 'odd' for x in X.ravel()])

# --- Initialize soak object ---
for subset_value, category, fold_id, random_seed, train_idx_final, test_same_idx in soakpy.split(subset_vec, n_splits=2, n_random_seeds=2):
    print(f"test subset: {subset_value:6s} --- category: {category:6s} --- test fold: {fold_id}")
    print(f"y_test : {y[test_same_idx]}")
    print(f"y_train: {y[train_idx_final]}")
    print("-"*50)
test subset: even   --- category: same   --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 6  8 12 14]
--------------------------------------------------
test subset: even   --- category: other  --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [1 9]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 6 14]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 8 14]
--------------------------------------------------
test subset: even   --- category: all    --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 1  6  8  9 12 14]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [ 1 14]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 1
y_test : [ 0  2  4 10]
y_train: [12 14]
--------------------------------------------------
test subset: odd    --- category: same   --- test fold: 1
y_test : [3 5 7]
y_train: [1 9]
--------------------------------------------------
test subset: odd    --- category: other  --- test fold: 1
y_test : [3 5 7]
y_train: [ 6  8 12 14]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 1
y_test : [3 5 7]
y_train: [12 14]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 1
y_test : [3 5 7]
y_train: [ 8 14]
--------------------------------------------------
test subset: odd    --- category: all    --- test fold: 1
y_test : [3 5 7]
y_train: [ 1  6  8  9 12 14]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 1
y_test : [3 5 7]
y_train: [8 9]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 1
y_test : [3 5 7]
y_train: [ 8 14]
--------------------------------------------------
test subset: even   --- category: same   --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 0  2  4 10]
--------------------------------------------------
test subset: even   --- category: other  --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [3 5 7]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [0 2]
--------------------------------------------------
test subset: even   --- category: same-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 4 10]
--------------------------------------------------
test subset: even   --- category: all    --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 0  2  3  4  5  7 10]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [0 5]
--------------------------------------------------
test subset: even   --- category: all-ds --- test fold: 2
y_test : [ 6  8 12 14]
y_train: [ 2 10]
--------------------------------------------------
test subset: odd    --- category: same   --- test fold: 2
y_test : [1 9]
y_train: [3 5 7]
--------------------------------------------------
test subset: odd    --- category: other  --- test fold: 2
y_test : [1 9]
y_train: [ 0  2  4 10]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 2
y_test : [1 9]
y_train: [0 4]
--------------------------------------------------
test subset: odd    --- category: other-ds --- test fold: 2
y_test : [1 9]
y_train: [ 2 10]
--------------------------------------------------
test subset: odd    --- category: all    --- test fold: 2
y_test : [1 9]
y_train: [ 0  2  3  4  5  7 10]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 2
y_test : [1 9]
y_train: [2 7]
--------------------------------------------------
test subset: odd    --- category: all-ds --- test fold: 2
y_test : [1 9]
y_train: [2 5]
--------------------------------------------------

High-level: Analyze dataset and Visualize

import soakpy
import pandas as pd

df = pd.read_csv("https://github.com/lamtung16/soak_regression/raw/refs/heads/main/data/WorkersCompensation.csv.xz")
soak_obj = soakpy.SOAK(df=df, subset_col="Gender", target_col="UltimateIncurredClaimCost")
soak_obj.analyze(model_list=["featureless", "tree"], n_splits=5, n_random_seeds=5, log_target=True)
soak_obj.visualize(subset_value='M', model="tree", metric="rmse", figsize=(12, 2.5))

Image

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soakpy-0.0.4.tar.gz (112.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

soakpy-0.0.4-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file soakpy-0.0.4.tar.gz.

File metadata

  • Download URL: soakpy-0.0.4.tar.gz
  • Upload date:
  • Size: 112.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for soakpy-0.0.4.tar.gz
Algorithm Hash digest
SHA256 1a901f62aa615189db05c46a8977eeef4e3b9b0a38acbeddc7f044c56afc1134
MD5 325fb6dd1f6eec19998ae9ac556b44e3
BLAKE2b-256 3adf2e8967215f1b267118a865506d63266d33e18e5e3e738aa5d084654a4da6

See more details on using hashes here.

File details

Details for the file soakpy-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: soakpy-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for soakpy-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 addb7558aaff5a1319e02df970cb2514dba448a919e8c09cb3002a83b085f164
MD5 e8d5e5e4bfb7319e1fd4ccf9ba9337b4
BLAKE2b-256 9a8dd5b92f014a5a73516faa85d9fc2928a9315b6f7229613919e2ba06754e6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page