Dataset shift analysis and characterization in python

Project description

dashi

License Python Version

Dataset shift analysis and characterization in python

What is `dashi`?

dashi is a Python library designed to analyze and characterize temporal and multi-source dataset shifts. It provides robust tools for both supervised and unsupervised evaluation of dataset shifts, empowering users to detect, understand, and address changes in data distributions with confidence.

Key Features:

Supervised Characterization: Enables users to create classification or regression models using Random Forests trained on batched data (temporal or multi-source). This allows for the detailed analysis of how dataset shifts impact model performance, helping to pinpoint areas of potential degradation.
Unsupervised Characterization: Facilitates the identification of temporal dataset shifts by projecting and visualizing data dissimilarities across time. This process involves:
- Estimating data statistical distributions over time.
- Projecting these distributions onto non-parametric statistical manifolds. These projections reveal patterns of latent temporal variability in the data, uncovering hidden trends and shifts.

Visualization Tools:

To aid exploration and interpretation of dataset shifts, dashi includes visual analytics features such as:

Data Temporal Heatmaps (DTHs): Provide an exploratory visualization for temporal shifts in data distributions.
Information Geometric Temporal (IGT) plots: Offer a more sophisticated view of temporal data variability by means of embedding temporal batches in their latent statistical manifolds.
Multi-batch contingency matrices: Compare multiple evaluation metrics (F1-Score, Recall, Precision, AUC, etc.) across training-test combinations between pairwise batches, either temporal or multi-source.

Installation

You can install dashi using pip:

pip install dashi

Or install from source:

git clone https://github.com/bdslab-upv/dashi
cd dashi
pip install .

Usage & Examples

You can find the tutorial on ho to use dashi in this link or in the examples directory.

Documentation

Detailed documentation is available at documentation.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Copyright 2024 Biomedical Data Science Lab, ITACA Institute, Universitat Politècnica de València (Spain)

Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for
additional information regarding copyright ownership. The ASF licenses this
file to you under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

Part of the Python library dashi has been inspired by the R EHRtemporalVariability package, licensed under the Apache 2.0 License, and authored by part of this dashi library authors.

Authorship

Authors: David Fernández Narro (UPV), Pablo Ferri Borredá (UPV), Ángel Sánchez-García (UPV), Juan M García-Gómez (UPV), Carlos Sáez (UPV)
Contact: dashi@upv.es

Acknowledgements

Funded by Agencia Estatal de Investigación—Proyectos de Generación de Conocimiento 2022, project KINEMAI (PID2022-138636OA-I00).

References

Sáez, C., Rodrigues, P. P., Gama, J., Robles, M., & García-Gómez, J. M. (2015). Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Mining and Knowledge Discovery, 29(4), 950-975. https://doi.org/10.1007/s10618-014-0378-6
Sáez, C., & García-Gómez, J. M. (2018). Kinematics of Big Biomedical Data to characterize temporal variability and seasonality of data repositories: Functional Data Analysis of data temporal evolution over non-parametric statistical manifolds. International Journal of Medical Informatics, 119, 109-124. https://doi.org/10.1016/j.ijmedinf.2018.09.015
Sáez, C., Zurriaga, O., Pérez-Panadés, J., Melchor, I., Robles, M., & García-Gómez, J. M. (2016). Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: A systematic approach to quality control of repositories. Journal of the American Medical Informatics Association, 23(6), 1085-1095. https://doi.org/10.1093/jamia/ocw010
Sáez C, Gutiérrez-Sacristán A, Kohane I, García-Gómez JM, Avillach P. EHRtemporalVariability: delineating temporal data-set shifts in electronic health records. GigaScience, Volume 9, Issue 8, August 2020, giaa079. https://doi.org/10.1093/gigascience/giaa079
Sáez, C., Robles, M. and García-Gómez, J.M., 2017. Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Statistical methods in medical research. 2017;26(1):312-336. https://doi.org/10.1177/0962280214545122

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 7, 2025

0.0.0

Oct 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dashi-0.1.0.tar.gz (42.2 kB view details)

Uploaded Mar 7, 2025 Source

Built Distribution

dashi-0.1.0-py3-none-any.whl (48.9 kB view details)

Uploaded Mar 7, 2025 Python 3

File details

Details for the file dashi-0.1.0.tar.gz.

File metadata

Download URL: dashi-0.1.0.tar.gz
Upload date: Mar 7, 2025
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for dashi-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`252cbe689cd0c742f145df9ee79dca9156c509f7c7ffb0d72d439384f532c384`
MD5	`89922b3682a57b1f4e262e58321521b0`
BLAKE2b-256	`3d95ec1008435d72b28e8c200f3c7e317bcfa960ed5e5f7d5ff2358747d68d11`

See more details on using hashes here.

File details

Details for the file dashi-0.1.0-py3-none-any.whl.

File metadata

Download URL: dashi-0.1.0-py3-none-any.whl
Upload date: Mar 7, 2025
Size: 48.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for dashi-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cd010361b43e6c0cbd06305a3a318d5fe14648e296b72e0055e53ac53964860`
MD5	`378a5ce4e7a462bb44cf3f599f4cf231`
BLAKE2b-256	`3e2a31423314f0c01c49b8365f97a6530e2eb913c63e5cd83ec71ca6be4c90a8`

See more details on using hashes here.

dashi 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dashi

What is `dashi`?

Key Features:

Visualization Tools:

Installation

Usage & Examples

Documentation

License

Authorship

Acknowledgements

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

dashi 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dashi

What is dashi?

Key Features:

Visualization Tools:

Installation

Usage & Examples

Documentation

License

Authorship

Acknowledgements

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What is `dashi`?