Skip to main content

Explainable instance significance discovery for scientific datasets

Project description

DataTypical

Explainable Instance Significance Discovery for Scientific Datasets

Python 3.8+ License: MIT DOI

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.


Key Features

  • Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously, or selectively)
  • Shapley Explanations: Feature-level attributions for why samples are significant
  • Formative Discovery: Distinguish samples that ARE significant from those that CREATE the significance structure
  • Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
  • Multi-Modal Support: Tabular data, text, and graph networks through unified API
  • Performance Optimized: Fast exploration mode and efficient Shapley computation

Installation

pip install datatypical

Quick Start

from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')

What DataTypical Does

Three Complementary Lenses

Lens Finds Use Cases
Archetypal Extreme, boundary samples Edge case discovery, outlier detection, range understanding
Prototypical Representative, central samples Dataset summarization, cluster centers, data coverage
Stereotypical Target-similar samples Optimization, goal-oriented selection, phenotype matching

The Power: All three computed simultaneously—different perspectives reveal different insights.

Dual Perspective (with Shapley)

When shapley_mode=True, DataTypical reveals two views:

  • Actual Significance (*_rank): Samples that ARE significant
  • Formative Significance (*_shapley_rank): Samples that CREATE the structure

This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.


Example: Drug Discovery

# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism

Discovery: Multiple structural pathways to high activity.


Key Parameters

DataTypical(
    shapley_mode=False,           # True for explanations
    fast_mode=True,               # False for publication quality
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    stereotype_target='max',      # 'max', 'min', or numeric value
    selected_significance=None,   # 'archetypal', 'prototypical', 'stereotypical', or None (all)
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations
    random_state=None,            # Set for reproducible results
    max_memory_mb=8000            # Memory limit
)

Selective Computation

Use selected_significance when you only need one type, to skip the others and save compute time:

# Only compute archetypal rankings and Shapley
dt = DataTypical(selected_significance='archetypal', shapley_mode=True)
results = dt.fit_transform(data)
# → archetypal_rank computed; other ranks are NaN

Visualization Functions

from datatypical_viz import significance_plot, heatmap, profile_plot

# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')

# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')

Multi-Modal Support

Tabular Data

df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)

Text Data

texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)

Graph Networks

node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)

Performance

Dataset Size Without Shapley With Shapley
1,000 samples ~5 seconds ~5 minutes
10,000 samples ~30 seconds ~60 minutes

Optimization Strategy:

  1. Fast exploration (fast_mode=True, no Shapley)
  2. Identify interesting samples
  3. Detailed analysis (shapley_mode=True, subset)
  4. Generate publication figures

Use Cases

Scientific Discovery: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification

Dataset Curation: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling

Model Understanding: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations


What Makes DataTypical Different

From outlier detection: Finds extremes AND explains why

From clustering: Finds representatives maximizing coverage AND explains why

From feature selection: Explains which features matter for which samples

From PCA/t-SNE: Maintains interpretability in original feature space

The Novel Contribution: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.


What's New in v0.7.6

  • selected_significance parameter: Compute only one significance type ('archetypal', 'prototypical', or 'stereotypical') and skip the rest, reducing compute time substantially
  • Fixed prototype transform: transform() on new data now uses stored training prototype vectors, not indices into the new data
  • Text Shapley support: Full Shapley analysis (formative + explanations) now runs correctly on text data paths
  • Robustness fixes: Fixed iterator exhaustion in text methods, fixed index misalignment in stereotypical Shapley explanations when subsampling, improved error messages

Documentation

Complete documentation, examples, and guides available at:
https://github.com/amaxiom/DataTypical

Includes:

  • Getting started tutorials
  • Comprehensive examples across scientific domains
  • Visualization interpretation guides
  • Advanced usage and computation details
  • Test suite and benchmarks

Support


Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.20
  • Pandas ≥ 1.3
  • SciPy ≥ 1.7
  • scikit-learn ≥ 1.0
  • Matplotlib ≥ 3.3
  • Seaborn ≥ 0.11
  • Numba ≥ 0.55

Citation

If you use DataTypical in your research, please cite:

@software{datatypical2026,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7.6},
  doi={10.5281/zenodo.18666410}
}

License

MIT License - Copyright (c) 2026 Amanda S. Barnard

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.


Acknowledgments

DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatypical-0.7.6.tar.gz (46.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datatypical-0.7.6-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file datatypical-0.7.6.tar.gz.

File metadata

  • Download URL: datatypical-0.7.6.tar.gz
  • Upload date:
  • Size: 46.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.6.tar.gz
Algorithm Hash digest
SHA256 8b31879a3279efc03e4d3622be7131c91457a5d9f7d27fdb6205b908fc3977cc
MD5 be4f914516669aac470fe77167ff04bb
BLAKE2b-256 09285e24b822179f2057b05655cdb29374f9490d17c6a87ab051a0f488b91b63

See more details on using hashes here.

File details

Details for the file datatypical-0.7.6-py3-none-any.whl.

File metadata

  • Download URL: datatypical-0.7.6-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6209a56acb6966ff2d071dd2c02ca858daf6c3f32b9362bb0fa68ea38a51548a
MD5 59cb0330bf9e6f754c3b657a2ddddea6
BLAKE2b-256 634a2ce2653270ccd8e37751cc85971953da15181fb15db3da87b83466d91e6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page