Skip to main content

Group disjunct and stratified data splits

Project description

Machine learning data partition tools

Author

Uwe Reichel, audEERING GmbH, Gilching, Germany

Purpose

  • machine learning data splitting tool that allows for:
    • group-disjunct splits (e.g. different speakers in train, dev, and test partition)
    • stratification on multiple target and grouping variables (e.g. emotion, gender, language)

Installation

From PyPI

  • set up a virtual environment venv_splitutils, activate it, and install splitutils. For Linux this works e.g. as follows:
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
(venv_splitutils) $ pip install splitutils

From GitHub

$ git clone git@github.com:reichelu/spliutils.git
$ cd splitutils/
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
$ (venv_splitutils) $ pip install -r requirements.txt

Synopsis

optimize_traintest_split()

def optimize_traintest_split(X, y, split_on, stratify_on, weight=None,
                             test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training and test set which is guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so that it is
      additionally tested that all partitions cover all classes.
      Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how test size diff should be weighted.
    test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be
      held-out
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "size_testset_in_spliton": intended test_size
        "size_testset_in_X": optimized test proportion in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''

optimize_traindevtest_split()

def optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,
                                dev_size=.1, test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training, dev, and test set, which is
    guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so
         that it is additionally tested that all partitions cover all classes.
         Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how the corresponding size differences should be weighted.
    dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers
      to be held-out
    test_size: (float) test proportion in set(split_on) for test set
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    dev_i: (np.array) dev set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "dev_size_in_spliton": intended grouping dev_size
        "dev_size_in_X": optimized dev proportion of observations in X
        "test_size_in_spliton": intended grouping test_size
        "test_size_in_X": optimized test proportion of observations in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_dev_{c}": dev set class distribution calculated from stratify_on[c][dev_i]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''

binning()

def binning(x, nbins=2, lower_boundaries=None, seed=42):

    '''
    bins numeric data.

    If X is one-dimensional:
        binning is done either intrinsically into nbins classes
        based on an equidistant percentile split, or extrinsically
        by using the lower_boundaries values.
    If X is two-dimensional
        binning is done by kmeans clustering into nbins clusters

    Parameters:
    x: (list, np.array) with numeric data.
    nbins: (int) number of bins
    lower_boundaries: (list) of lower bin boundaries.
      If y is 1-dim and lower_boundaries is provided, nbins will be ignored
      and y is binned extrinsically. The first value of lower_boundaries
      is always corrected not to be higher than min(y).
    seed: (int) random seed for kmeans

    Returns:
    c: (np.array) integers as bin IDs
    '''

Usage

Example 1: Split dummy data into training and test partitions

  • see scripts/run_traintest_split.py
  • partitions are:
    • disjunct on categorical "split_var"
    • stratified on categorical "target", "strat_var1", "strat_var2"
    • each contain all levels of "target"
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["L", "M"], size=n, replace=True),
    "strat_var2": np.random.choice(["N", "O"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# test partition proportion (from 0 to 1)
test_size = .2

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, test_i, info = optimize_traintest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Example 2: Split dummy data into training, development, and test partitionsy

  • see scripts/run_traindevtest_split.py
  • Partitions are
    • disjunct on categorical "split_var"
    • stratified on categorical "target", "strat_var1", "strat_var2"
    • each contain all levels of "target"
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["F", "G"], size=n, replace=True),
    "strat_var2": np.random.choice(["H", "I"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric

  • see scripts/run_traindevtest_split_with_binning.py
  • Partitions are
    • disjunct on categorical "split_var"
    • stratified on numeric "target", and on 3 other numeric stratification variables
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import (
    binning,
    optimize_traindevtest_split
)

"""
example script how to split dummy data into training, development,
and test partitions that are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification
  variables
"""

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# features
data = np.random.rand(n, 20)

# numeric target variable
num_target = np.random.rand(n)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# further numeric variables to stratify on
num_strat_vars = np.random.rand(n, 3)

# intrinsically bin target into 3 bins by equidistant
# percentile boundaries
binned_target = binning(num_target, nbins=3)

# ... alternatively, a variable can be extrinsically binned by
# specifying lower boundaries:
# binned_target = binning(target, lower_boundaries: [0, 0.33, 0.66])

# bin other stratification variables into a single variable with 6 bins
# (2-dim input is binned by StandardScaling and KMeans clustering)
binned_strat_var = binning(num_strat_vars, nbins=6)

# ... alternatively, each stratification variable could be binned
# individually - intrinsically or extrinsically the same way as num_target
# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.

# now add the obtained categorical variable to stratification dict
stratif_vars = {
    "target": binned_target,
    "strat_var": binned_strat_var
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes
weights = {
    "target": 2,
    "strat_var": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=num_target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)

Algorithm

  • find optimal train, dev, and test set split based on:
    • disjunct split of a categorical grouping variable G (e.g. speaker)
    • optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)
    • close match of partition proportions in G and underlying dataset X
  • brute-force optimization on k disjunct splits of G
  • score to be minimzed for train/test set split:
(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
irad(v): information radius between reference and test set distribution of factor levels in v
d: absolute difference between test proportions of X and G, i.e. between the proportion of test
   samples and the proportion of groups (e.g. speakers) that go into the test set
w(d): its weight
  • score to be minimzed for train / dev / test set split:
(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
max_irad(v): maximum information radius of reference distribution of classes in v and
             - dev set distribution,
             - test set distribution
max_d: maximum of absolute difference between proportions of X and G (see above) calculated for
       the dev and test set
w(d): its weight

How to interprete the returned info dict

  • let's look at Example 2 above. There info becomes:
{
  'score': 0.030828359568603338,
  'size_devset_in_spliton': 0.1,
  'size_devset_in_X': 0.14,
  'size_testset_in_spliton': 0.1,
  'size_testset_in_X': 0.13,
  'p_target_ref': {'B': 0.49, 'A': 0.51},
  'p_target_dev': {'A': 0.5, 'B': 0.5},
  'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},
  'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},
  'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},
  'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},
  'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},
  'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},
  'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}
}
  • Explanations
    • score: see above, score to be minimzed for train / dev / test set split:
    • size_devset_in_spliton: proportion of to-be-split-on variable levels in development set
    • size_devset_in_X: proportion of rows in X in development set
    • size_testset_in_spliton: proportion of to-be-split-on variable levels in test set
    • size_testset_in_X: proportion of rows in X in test set
    • p_target_ref: reference target class distribution over all data
    • p_target_dev: target class distribution in development set
    • p_target_test: target class distribution in test set
    • p_strat_var1_ref: first stratification variable's reference distribution over all data
    • p_strat_var1_dev: first stratification variable's class distribution in development set
    • p_strat_var1_test: first stratification variable's class distribution in test set
    • p_strat_var2_ref: second stratification variable's reference distribution over all data
    • p_strat_var2_dev: second stratification variable's class distribution in development set
    • p_strat_var2_test: second stratification variable's class distribution in test set
  • Remarks
    • for splitutils.optimize_traintest_split() no development set results are reported
    • all *_strat_var* keys: key names derived from key names in stratify_on argument

Further documentation

  • Please find further details on the split scores and numeric variable binning in this pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitutils-0.2.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

splitutils-0.2.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file splitutils-0.2.0.tar.gz.

File metadata

  • Download URL: splitutils-0.2.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for splitutils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 939aa27a7e76d77b22a4eef7a537e54f743b93a5d69097703dadf6643d91b6c8
MD5 e1b64c868eeb291f8549d2ae3821af11
BLAKE2b-256 3cd0130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154

See more details on using hashes here.

File details

Details for the file splitutils-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: splitutils-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for splitutils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3fe9eb7d64609b1797e6657dd3666cf736d541d3265a219f2620a60c57a7f59e
MD5 44b03bea016e8d13e0db569fdc4164ac
BLAKE2b-256 faa45c3bdbdca9f7225c0c6493cd449e1169bb62c83299c1136d56dda900dde8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page