A Data-Centric AI library for measuring hardness categorization.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Datagnosis

A Data-Centric AI library for measuring hardness categorization.

Features:

🔑 Easy to extend pluginable architecture.
🌀 Several state-of-the-art hardness characterisation methods.
📚 Read the docs !
✈️ Checkout the tutorials!

Please note: datagnosis does not handle missing data and so these values must be imputed first HyperImpute can be used to do this.

🚀 Installation

The library can be installed from PyPI using

$ pip install datagnosis

or from source, using

$ pip install .

Other library extensions:

Install the library with unit-testing support

 pip install datagnosis[testing]

💥 Sample Usage

# Load iris dataset from sklearn and create DataHandler object
from sklearn.datasets import load_iris
from datagnosis.plugins.core.datahandler import DataHandler
X, y = load_iris(return_X_y=True, as_frame=True)
datahander = DataHandler(X, y, batch_size=32)

# Create model an parameters
from datagnosis.plugins.core.models.simple_mlp import SimpleMLP
import torch

model = SimpleMLP()

# creating our optimizer and loss function object
learning_rate = 0.01
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)


# Get a plugin and fit it
hcm = Plugins().get(
    "vog",
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr=learning_rate,
    epochs=10,
    num_classes=3,
    logging_interval=1,
)
hcm.fit(
    datahandler=datahander,
    use_caches_if_exist=True,
)

# Plot the resulting scores
hcm.plot_scores(axis=1, plot_type="scatter")

🔑 Methods

Datagnosis builds on D-CAT which is a Hardness Characterization Method Benchmarking framework also from the van der Schaar lab.

For benchmarking of the below methods see https://github.com/seedatnabeel/D-CAT.

Generic methods

Method	Type	Description	Score	Reference
Area Under the Margin (AUM)	Generic	Characterizes data examples based on the margin of a classifier – i.e. the difference between the logit values of the correct class and the next class.	Hard - low scores.	AUM Paper
Confident Learning	Generic	Confident learning estimates the joint distribution of noisy and true labels — characterizing data as easy and hard for mislabeling.	Hard - low scores	Confident Learning Paper
Conf Agree	Generic	Agreement measures the agreement of predictions on the same example.	Hard - low scores	Conf Agree Paper
Data IQ	Generic	Data-IQ computes the aleatoric uncertainty and confidence to characterize the data into easy, ambiguous and hard examples.	Hard - low confidence scores. High Aleatoric Uncertainty scores define ambiguous	Data-IQ Paper
Data Maps	Generic	Data Maps focuses on measuring variability (epistemic uncertainty) and confidence to characterize the data into easy, ambiguous and hard examples.	Hard - low confidence scores. High Epistemic Uncertainty scores define ambiguous	Data-Maps Paper
Gradient Normed (GraNd)	Generic	GraNd measures the gradient norm to characterize data.	Hard - high scores	GraNd Paper
Error L2-Norm (EL2N)	Generic	EL2N calculates the L2 norm of error over training in order to characterize data for computational purposes.	Hard - high scores	EL2N Paper
Forgetting	Generic	Forgetting scores analyze example transitions through training. i.e., the time a sample correctly learned at one epoch is then forgotten.	Hard - high scores	Forgetting Paper
Large Loss	Generic	Large Loss characterizes data based on sample-level loss magnitudes.	Hard - high scores	Large Loss Paper
Prototypicalilty	Generic	Prototypicality calculates the latent space clustering distance of the sample to the class centroid as the metric to characterize data.	Hard - high scores	Prototypicalilty Paper
Variance of Gradients (VOG)	Generic	VoG (Variance of gradients) estimates the variance of gradients for each sample over training	Hard - high scores	VOG Paper
Active Learning Guided by Local Sensitivity and Hardness (ALLSH)	Images	ALLSH computes the KL divergence of softmax outputs between original and augmented samples to characterize data.	Hard - high scores	ALLSH Paper

Generic type plugins can be used for tabular or image data. Image type plugins only work for images.

🔨 Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vvvsx tests/ --durations=50

Contributing to datagnosis

We want to make contributing to datagnosis is as easy and transparent as possible. We hope to collaborate with as many people as we can.

Development installation

First create a new environment. It is recommended that you use conda. This can be done as follows:

conda create -n your-datagnosis-env python=3.11
conda activate your-datagnosis-env

Python versions , 3.8, 3.9, 3.10, 3.11 are all compatible, but it is best to use the most up to date version you can, as some models may not support older python versions.

To get the development installation with all the necessary dependencies for linting, testing, auto-formatting, and pre-commit etc. run the following:

git clone https://github.com/vanderschaarlab/datagnosis.git
cd datagnosis
pip install -e .[testing]

Please check that the pre-commit is properly installed for the repository, by running:

pre-commit run --all

This checks that you are set up properly to contribute, such that you will match the code style in the rest of the project. This is covered in more detail below.

⌨️ Our Development Process

🏂 Code Style

We believe that having a consistent code style is incredibly important. Therefore datagnosis imposes certain rules on the code that is contributed and the automated tests will not pass, if the style is not adhered to. These tests passing is a requirement for a contribution being merged. However, we make adhering to this code style as simple as possible. First, all the libraries required to produce code that is compatible with datagnosis's Code Style are installed in the step above when you set up the development environment. Secondly, these libraries are all triggered by pre-commit, so once you are set-up, you don't need to do anything. When you run git commit, any simple changes to enforce the style will run automatically and other required changes are explained in the stdout for you to go through and fix.

datagnosis uses the black and flake8 code formatter to enforce a common code style across the code base. No additional configuration should be needed (see the black documentation for advanced usage).

Also, datagnosis uses isort to sort imports alphabetically and separate into sections.

❕Type Hints

datagnosis is fully typed using python 3.7+ type hints. This is enforced for contributions by mypy, which is a static type-checker.

↩️ Pull Requests

We actively welcome your pull requests.

Fork the repo and create your branch from main.
If you have added code that should be tested, add tests in the same style as those already present in the repo.
If you have changed APIs, document the API change in the PR.
Ensure the test suite passes.
Make sure your code passes the pre-commit, this will be required in order to commit and push, if you have properly installed pre-commit, which is included in the testing extra.

🔶 Issues

We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.

📜 License

By contributing to datagnosis, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree. You should therefore, make sure that if you have introduced any dependencies that they also are covered by a license that allows the code to be used by the project and is compatible with the license in the root directory of this project.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.0.3

Sep 29, 2023

0.0.2

Aug 24, 2023

0.0.1

Aug 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

datagnosis-0.0.3-py3-none-macosx_10_14_x86_64.whl (60.1 kB view hashes)

Uploaded Sep 29, 2023 Python 3 macOS 10.14+ x86-64

datagnosis-0.0.3-py3-none-any.whl (60.6 kB view hashes)

Uploaded Sep 29, 2023 Python 3

Hashes for datagnosis-0.0.3-py3-none-macosx_10_14_x86_64.whl

Hashes for datagnosis-0.0.3-py3-none-macosx_10_14_x86_64.whl
Algorithm	Hash digest
SHA256	`64568158bc1ea92178eca124a06669b9324a0e5f2b08bb5b1aab4aa0b84860d5`
MD5	`d22a02daf951c1a8119b1de5cb40fe8f`
BLAKE2b-256	`5814948b20e1c48e6452c74fc1bdafcfeae7752970aee215fa843c7d37392f51`

Hashes for datagnosis-0.0.3-py3-none-any.whl

Hashes for datagnosis-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6f70108ab2b8e636f209975d87065f0cad2dcb694a547cc7e57be53da66c4c6`
MD5	`a8297401838ebd39bf8d0d4b7dd64119`
BLAKE2b-256	`ce86c250ba601de3cce708a244591ae5205a711d75d063c951c58861792009a5`