Skip to main content

Group disjunct and stratified data splits

Project description

Machine learning data partitioning

Author

Uwe Reichel, audEERING GmbH, Gilching, Germany

Purpose

  • machine learning data splitting tool that allows for:
    • group-disjunct splits (e.g. different speakers in train, dev, and test partition)
    • stratification on multiple target and grouping variables (e.g. emotion, gender, language)

Installation

From PyPI

  • set up a virtual environment venv_splitutils, activate it, and install splitutils. For Linux this works e.g. as follows:
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
(venv_splitutils) $ pip install splitutils

From GitHub

$ git clone git@github.com:reichelu/spliutils.git
$ cd splitutils/
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
$ (venv_splitutils) $ pip install -r requirements.txt

Synopsis

optimize_traintest_split()

def optimize_traintest_split(X, y, split_on, stratify_on, weight=None,
                             test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training and test set which is guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so that it is
      additionally tested that all partitions cover all classes.
      Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how test size diff should be weighted.
    test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be
      held-out
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "size_testset_in_spliton": intended test_size
        "size_testset_in_X": optimized test proportion in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''

optimize_traindevtest_split()

def optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,
                                dev_size=.1, test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training, dev, and test set, which is
    guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so
         that it is additionally tested that all partitions cover all classes.
         Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how the corresponding size differences should be weighted.
    dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers
      to be held-out
    test_size: (float) test proportion in set(split_on) for test set
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    dev_i: (np.array) dev set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "dev_size_in_spliton": intended grouping dev_size
        "dev_size_in_X": optimized dev proportion of observations in X
        "test_size_in_spliton": intended grouping test_size
        "test_size_in_X": optimized test proportion of observations in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_dev_{c}": dev set class distribution calculated from stratify_on[c][dev_i]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''

binning()

def binning(x, nbins=2, lower_boundaries=None, seed=42):

    '''
    bins numeric data.

    If X is one-dimensional:
        binning is done either intrinsically into nbins classes
        based on an equidistant percentile split, or extrinsically
        by using the lower_boundaries values.
    If X is two-dimensional
        binning is done by kmeans clustering into nbins clusters

    Parameters:
    x: (list, np.array) with numeric data.
    nbins: (int) number of bins
    lower_boundaries: (list) of lower bin boundaries.
      If y is 1-dim and lower_boundaries is provided, nbins will be ignored
      and y is binned extrinsically. The first value of lower_boundaries
      is always corrected not to be higher than min(y).
    seed: (int) random seed for kmeans

    Returns:
    c: (np.array) integers as bin IDs
    '''

Reference

if you use this software for a publication please cite:

Reichel, U.: splitutils - machine learning data partitioning software, version 0.3.0. doi:10.5281/zenodo.10793086, 2024.

@Misc{splitutils,
  author =       {Reichel, U.},
  title =        {splitutils -- machine learning data partitioning software, version 0.3.0},
  howpublished = {doi:10.5281/zenodo.10793086},
  year =         {2024}
}

Usage

Example 1: Split dummy data into training and test partitions

  • see scripts/run_traintest_split.py
  • partitions are:
    • disjunct on categorical "split_var"
    • stratified on categorical "target", "strat_var1", "strat_var2"
    • each contain all levels of "target"
import numpy as np
import os
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["L", "M"], size=n, replace=True),
    "strat_var2": np.random.choice(["N", "O"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# test partition proportion (from 0 to 1)
test_size = .2

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, test_i, info = optimize_traintest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Example 2: Split dummy data into training, development, and test partitions

  • see scripts/run_traindevtest_split.py
  • Partitions are
    • disjunct on categorical "split_var"
    • stratified on categorical "target", "strat_var1", "strat_var2"
    • each contain all levels of "target"
import numpy as np
import os
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["F", "G"], size=n, replace=True),
    "strat_var2": np.random.choice(["H", "I"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric

  • see scripts/run_traindevtest_split_with_binning.py
  • Partitions are
    • disjunct on categorical "split_var"
    • stratified on numeric "target", and on 3 other numeric stratification variables
import numpy as np
import os
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import (
    binning,
    optimize_traindevtest_split
)

"""
example script how to split dummy data into training, development,
and test partitions that are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification
  variables
"""

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# features
data = np.random.rand(n, 20)

# numeric target variable
num_target = np.random.rand(n)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# further numeric variables to stratify on
num_strat_vars = np.random.rand(n, 3)

# intrinsically bin target into 3 bins by equidistant
# percentile boundaries
binned_target = binning(num_target, nbins=3)

# ... alternatively, a variable can be extrinsically binned by
# specifying lower boundaries:
# binned_target = binning(num_target, lower_boundaries=[0, 0.33, 0.66])

# bin other stratification variables into a single variable with 6 bins
# (2-dim input is binned by StandardScaling and KMeans clustering)
binned_strat_var = binning(num_strat_vars, nbins=6)

# ... alternatively, each stratification variable could be binned
# individually - intrinsically or extrinsically the same way as num_target
# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.

# now add the obtained categorical variable to stratification dict
stratif_vars = {
    "target": binned_target,
    "strat_var": binned_strat_var
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes
weights = {
    "target": 2,
    "strat_var": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=num_target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Algorithm

  • find optimal train, dev, and test set split based on:
    • disjunct split of a categorical grouping variable G (e.g. speaker)
    • optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)
    • close match of partition proportions in G and underlying dataset X
  • brute-force optimization on k disjunct splits of G
  • score to be minimzed for train/test set split:
(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
irad(v): information radius between reference and test set distribution of factor levels in v
d: absolute difference between test proportions of X and G, i.e. between the proportion of test
   samples and the proportion of groups (e.g. speakers) that go into the test set
w(d): its weight
  • score to be minimzed for train / dev / test set split:
(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
max_irad(v): maximum information radius of reference distribution of classes in v and
             - dev set distribution,
             - test set distribution
max_d: maximum of absolute difference between proportions of X and G (see above) calculated for
       the dev and test set
w(d): its weight

How to interpret the returned info dict

  • let's look at Example 2 above. There info becomes:
{
  'score': 0.030828359568603338,
  'size_devset_in_spliton': 0.1,
  'size_devset_in_X': 0.14,
  'size_testset_in_spliton': 0.1,
  'size_testset_in_X': 0.13,
  'p_target_ref': {'B': 0.49, 'A': 0.51},
  'p_target_dev': {'A': 0.5, 'B': 0.5},
  'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},
  'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},
  'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},
  'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},
  'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},
  'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},
  'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}
}
  • Explanations
    • score: see above, score to be minimzed for train / dev / test set split:
    • size_devset_in_spliton: proportion of to-be-split-on variable levels in development set
    • size_devset_in_X: proportion of rows in X in development set
    • size_testset_in_spliton: proportion of to-be-split-on variable levels in test set
    • size_testset_in_X: proportion of rows in X in test set
    • p_target_ref: reference target class distribution over all data
    • p_target_dev: target class distribution in development set
    • p_target_test: target class distribution in test set
    • p_strat_var1_ref: first stratification variable's reference distribution over all data
    • p_strat_var1_dev: first stratification variable's class distribution in development set
    • p_strat_var1_test: first stratification variable's class distribution in test set
    • p_strat_var2_ref: second stratification variable's reference distribution over all data
    • p_strat_var2_dev: second stratification variable's class distribution in development set
    • p_strat_var2_test: second stratification variable's class distribution in test set
  • Remarks
    • for splitutils.optimize_traintest_split() no development set results are reported
    • all *_strat_var* keys: key names derived from key names in stratify_on argument

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitutils-0.4.3.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splitutils-0.4.3-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file splitutils-0.4.3.tar.gz.

File metadata

  • Download URL: splitutils-0.4.3.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for splitutils-0.4.3.tar.gz
Algorithm Hash digest
SHA256 ab74da235b2f95d0fe2559a23136507dc6304627e2d8ec65eeab06849a24ac25
MD5 32db22312cb300e08d57308aec2cef84
BLAKE2b-256 5781b5675fa21e303482e8b9e876572d1c14a465b0bc8b3a1277729d582a638b

See more details on using hashes here.

File details

Details for the file splitutils-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: splitutils-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for splitutils-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 241909ab79a474354e81748117c26d814c4410b54ee1816079c3907d6b00c450
MD5 408297cac07bfdebad68ff78e3e40b9f
BLAKE2b-256 4ce8ce5dfa5148ac8ad315fcea063a21462f95c2f633fa6f5b503f293d5956b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page