Skip to main content

Matches datapoints from an imbalanced class to a balanced one.

Project description

General Class Balancer

This program finds a subset of your dataset with balanced confounding factors (also known as "data matching"), though it can work with any combination of categorical and continuous variables. Given a labeled dataset with any number of classes and a number of confounding factors for each datapoint, this matches data in each class to one another, such that the distributions of each confounding factor are the same in each class. This may be used to sample a training set on which a given deep learning model will not take confounding factors into account during its classification.

To install:

pip3 install general-class-balancer

To use with a Pandas dataframe:

import pandas as pd

df = pd.read_pickle('dementia_diagnostic_records.pkl')

# Match the label "dementia" by age and sex
selection = class_balance(df,class_col="Dementia",confounds=["Age","Sex"],plim=0.1)

# selection is an array of booleans. This takes a matched subset of the dataframe
df = df[selection]

This will match patients in the column "Dementia" by age and sex. Note that continuous labels are not supported, only class labels.

A version of this method was originally introduced in https://arxiv.org/abs/2002.07874 to ensure that deep learning classifications of sex based on brain activity did not take into account head motion or intracranial volume, both of which are statistically different between sexes and which affect measurements of brain activity, but which we did not want the machine learning model to consider. A more detailed explanation of the method may be found in that paper. Below is a pictural description of the algorithm (assuming plim = 0.10)

alt text

Everything in the presented code uses numpy arrays. The code, as well as a script that simulates data from random variables, is given. Simply run

python random_example_test.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

general_class_balancer-0.0.7.tar.gz (159.4 kB view details)

Uploaded Source

Built Distribution

general_class_balancer-0.0.7-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file general_class_balancer-0.0.7.tar.gz.

File metadata

  • Download URL: general_class_balancer-0.0.7.tar.gz
  • Upload date:
  • Size: 159.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for general_class_balancer-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a7fc6cd88f7a63dfc32974c27b91fbea0b657c9ecd617956ea1d37f0ec5b0778
MD5 937567f3673edc3c8ae2c7665a7c812e
BLAKE2b-256 193c2ec3f87d7c016cb3f2f725c264897c4f778871531e9f82e69ae0274089df

See more details on using hashes here.

Provenance

The following attestation bundles were made for general_class_balancer-0.0.7.tar.gz:

Publisher: publish-to-pypi.yml on mleming/general_class_balancer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file general_class_balancer-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for general_class_balancer-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 da8d321bbb02d93450b630645604dfee60f774965827f710dbf135a465195581
MD5 e1fac5a24fd853b8c470b8eaeef17539
BLAKE2b-256 006355ecee0409b573bdf0725be8617ebb3ea496ebaeeea01e080e9ec1141fb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for general_class_balancer-0.0.7-py3-none-any.whl:

Publisher: publish-to-pypi.yml on mleming/general_class_balancer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page