Calculate case difficulty within datasets
Project description
CDmetrics
Case Difficulty (Instance Hardness) metrics in Python, with three ways to measure the difficulty of individual cases: CDmc, CDdm, and CDpu.
Case Difficulty Metrics
-
Case Difficulty Model Complexity (CDmc)
- CDmc is based on the complexity of the neural network required for accurate predictions.
-
Case Difficulty Double Model (CDdm)
- CDdm utilizes a pair of neural networks: one predicts a given case, and the other assesses the likelihood that the prediction made by the first model is correct.
-
Case Difficulty Predictive Uncertainty (CDpu)
- CDpu evaluates the variability of the neural network's predictions.
Getting Started
CDmetrics employs neural networks to measure the difficulty of individual cases in a dataset. The metrics are tailored to different definitions of prediction difficulty and are designed to perform well across various datasets.
Installation
The package was developed using Python. Below, we provide standard installation instructions and guidelines for using CDmetrics to calculate case difficulty on your own datasets.
For users
pip install CDmetrics
For developers
git clone https://github.com/data-intelligence-for-health-lab/CDmetrics.git
Anaconda environment
We strongly recommend using a separate Python environment. We provide an env file environment.yml to create a conda environment with all required dependencies:
conda env create --file environment.yml
Usage
Each metric requires certain parameters to run.
- CDmc requires number_of_NNs (the number of neural network models to make predictions):
from CDmetrics import CDmc
CDmc.compute_metric(data, number_of_NNs, target_column)
- CDdm requires num_folds (the number of folds to divide the data):
from CDmetrics import CDdm
CDdm.compute_metric(data, num_folds, target_column, max_layers, max_units, resources)
- CDpu requires number_of_predictions (the number of prediction probabilities to generate):
from CDmetrics import CDpu
CDpu.compute_metric(data, target_column, number_of_predictions, max_layers, max_units, resources)
The hyperparameters are tuned using Grid Search with Ray. To change the hyperparameter search space, update the search_space in tune_parameters function in CDmetrics/utils.py.
Guidelines for input dataset
Please follow the recommendations below:
- The dataset should be preprocessed (scaling, imputation, and encoding must be done before running CDmetrics).
- Data needs to be passed in a dataframe.
- Do not include any index column.
- The target column name must be clearly specified.
- The metrics only support classification problems with tabular data.
Citation
If you're using CDmetrics in your research or application, please cite our paper:
Kwon, H., Greenberg, M., Josephson, C.B. and Lee, J., 2024. Measuring the prediction difficulty of individual cases in a dataset using machine learning. Scientific Reports, 14(1), p.10474.
@article{kwon2024measuring,
title={Measuring the prediction difficulty of individual cases in a dataset using machine learning},
author={Kwon, Hyunjin and Greenberg, Matthew and Josephson, Colin Bruce and Lee, Joon},
journal={Scientific Reports},
volume={14},
number={1},
pages={10474},
year={2024},
publisher={Nature Publishing Group UK London}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cdmetrics-0.1.5.tar.gz.
File metadata
- Download URL: cdmetrics-0.1.5.tar.gz
- Upload date:
- Size: 6.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b3c9a62b37cd9a9ef96c5f30000a4b15b627b2a1eac96d8307dc6210cfec38f
|
|
| MD5 |
c31548e224f9c086a9e4c0c9730fdaee
|
|
| BLAKE2b-256 |
08cb34bb187a364208d571d27d4884ad724b9a2eaa4e2b4dbd0aaf27d7667df8
|
File details
Details for the file cdmetrics-0.1.5-py3-none-any.whl.
File metadata
- Download URL: cdmetrics-0.1.5-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
969d3ffda43a8457408d3855c2d1f5565df873f76c176f5907fc8395c405654c
|
|
| MD5 |
57fb482afa5898b2b0e5947d5e6cef29
|
|
| BLAKE2b-256 |
a190d066b5f2ada2f5eff97a3a83a317b15527272e80ceccd37dfbb69b7e1e09
|