Skip to main content

Explainable instance significance discovery for scientific datasets

Project description

DataTypical

Explainable Instance Significance Discovery for Scientific Datasets

Python 3.8+ License: MIT DOI

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.


Key Features

  • Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously)
  • Shapley Explanations: Feature-level attributions for why samples are significant
  • Formative Discovery: Distinguish samples that ARE significant from those that CREATE the significance structure
  • Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
  • Multi-Modal Support: Tabular data, text, and graph networks through unified API
  • Performance Optimized: Fast exploration mode and efficient Shapley computation

Installation

pip install datatypical

Quick Start

from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')

What DataTypical Does

Three Complementary Lenses

Lens Finds Use Cases
Archetypal Extreme, boundary samples Edge case discovery, outlier detection, range understanding
Prototypical Representative, central samples Dataset summarization, cluster centers, data coverage
Stereotypical Target-similar samples Optimization, goal-oriented selection, phenotype matching

The Power: All three computed simultaneously—different perspectives reveal different insights.

Dual Perspective (with Shapley)

When shapley_mode=True, DataTypical reveals two views:

  • Actual Significance (*_rank): Samples that ARE significant
  • Formative Significance (*_shapley_rank): Samples that CREATE the structure

This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.


Example: Drug Discovery

# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism

Discovery: Multiple structural pathways to high activity.


Key Parameters

DataTypical(
    shapley_mode=False,           # True for explanations
    fast_mode=True,               # False for publication quality
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations
    random_state=None,            # Set for reproducible results
    max_memory_mb=8000            # Memory limit
)

Visualization Functions

from datatypical_viz import significance_plot, heatmap, profile_plot

# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')

# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')

Multi-Modal Support

Tabular Data

df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)

Text Data

texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)

Graph Networks

node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)

Performance

Dataset Size Without Shapley With Shapley
1,000 samples ~5 seconds ~5 minutes
10,000 samples ~30 seconds ~60 minutes

Optimization Strategy:

  1. Fast exploration (fast_mode=True, no Shapley)
  2. Identify interesting samples
  3. Detailed analysis (shapley_mode=True, subset)
  4. Generate publication figures

Use Cases

Scientific Discovery: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification

Dataset Curation: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling

Model Understanding: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations


What Makes DataTypical Different

From outlier detection: Finds extremes AND explains why

From clustering: Finds representatives maximizing coverage AND explains why

From feature selection: Explains which features matter for which samples

From PCA/t-SNE: Maintains interpretability in original feature space

The Novel Contribution: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.


Documentation

Complete documentation, examples, and guides available at:
https://github.com/amaxiom/DataTypical

Includes:

  • Getting started tutorials
  • Comprehensive examples across scientific domains
  • Visualization interpretation guides
  • Advanced usage and computation details
  • Test suite and benchmarks

Support


Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.20
  • Pandas ≥ 1.3
  • SciPy ≥ 1.7
  • scikit-learn ≥ 1.0
  • Matplotlib ≥ 3.3
  • Seaborn ≥ 0.11
  • Numba ≥ 0.55

Citation

If you use DataTypical in your research, please cite:

@software{datatypical2025,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7},
  doi={10.5281/zenodo.18666410}
}

License

MIT License - Copyright (c) 2026 Amanda S. Barnard

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.


Acknowledgments

DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatypical-0.7.4.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datatypical-0.7.4-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file datatypical-0.7.4.tar.gz.

File metadata

  • Download URL: datatypical-0.7.4.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.4.tar.gz
Algorithm Hash digest
SHA256 3d152fa8614fd77f859c2977983f966fb76d77eb6c56f1d11a7b934b432fcec3
MD5 344e0c4f3bb6ef8c35647b89137c98b8
BLAKE2b-256 90797d1d2d26934886e762ee8558256c63ad86328754551c55f83c0e50e00800

See more details on using hashes here.

File details

Details for the file datatypical-0.7.4-py3-none-any.whl.

File metadata

  • Download URL: datatypical-0.7.4-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1ec4353739864ae65ad9dc812cce17035b5e351b8c3ff2f1baed9e6cb54574b3
MD5 c2556f3b0b898eba4de87ed6b5164133
BLAKE2b-256 c8dff00e3be60610e756e384c49fc8058a099a3875d133c554d1748bb546117a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page