Explainable instance significance discovery for scientific datasets

These details have not been verified by PyPI

Project links

Project description

DataTypical

Explainable Instance Significance Discovery for Scientific Datasets

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.

Key Features

Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously, or selectively)
Shapley Explanations: Feature-level attributions for why samples are significant
Formative Discovery: Distinguish samples that ARE significant from those that CREATE the significance structure
Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
Multi-Modal Support: Tabular data, text, and graph networks through unified API
Performance Optimized: Fast exploration mode and efficient Shapley computation

Installation

pip install datatypical

Quick Start

from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')

What DataTypical Does

Three Complementary Lenses

Lens	Finds	Use Cases
Archetypal	Extreme, boundary samples	Edge case discovery, outlier detection, range understanding
Prototypical	Representative, central samples	Dataset summarization, cluster centers, data coverage
Stereotypical	Target-similar samples	Optimization, goal-oriented selection, phenotype matching

The Power: All three computed simultaneously—different perspectives reveal different insights.

Dual Perspective (with Shapley)

When shapley_mode=True, DataTypical reveals two views:

Actual Significance (*_rank): Samples that ARE significant
Formative Significance (*_shapley_rank): Samples that CREATE the structure

This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.

Example: Drug Discovery

# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism

Discovery: Multiple structural pathways to high activity.

Key Parameters

DataTypical(
    shapley_mode=False,           # True for explanations
    fast_mode=True,               # False for publication quality
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    stereotype_target='max',      # 'max', 'min', or numeric value
    selected_significance=None,   # 'archetypal', 'prototypical', 'stereotypical', or None (all)
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations
    random_state=None,            # Set for reproducible results
    max_memory_mb=8000            # Memory limit
)

Selective Computation

Use selected_significance when you only need one type, to skip the others and save compute time:

# Only compute archetypal rankings and Shapley
dt = DataTypical(selected_significance='archetypal', shapley_mode=True)
results = dt.fit_transform(data)
# → archetypal_rank computed; other ranks are NaN

Visualization Functions

from datatypical_viz import significance_plot, heatmap, profile_plot

# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')

# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')

Multi-Modal Support

Tabular Data

df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)

Text Data

texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)

Graph Networks

node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)

Performance

Dataset Size	Without Shapley	With Shapley
1,000 samples	~5 seconds	~5 minutes
10,000 samples	~30 seconds	~60 minutes

Optimization Strategy:

Fast exploration (fast_mode=True, no Shapley)
Identify interesting samples
Detailed analysis (shapley_mode=True, subset)
Generate publication figures

Use Cases

Scientific Discovery: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification

Dataset Curation: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling

Model Understanding: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations

What Makes DataTypical Different

From outlier detection: Finds extremes AND explains why

From clustering: Finds representatives maximizing coverage AND explains why

From feature selection: Explains which features matter for which samples

From PCA/t-SNE: Maintains interpretability in original feature space

The Novel Contribution: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.

What's New in v0.7.6

selected_significance parameter: Compute only one significance type ('archetypal', 'prototypical', or 'stereotypical') and skip the rest, reducing compute time substantially
Fixed prototype transform: transform() on new data now uses stored training prototype vectors, not indices into the new data
Text Shapley support: Full Shapley analysis (formative + explanations) now runs correctly on text data paths
Robustness fixes: Fixed iterator exhaustion in text methods, fixed index misalignment in stereotypical Shapley explanations when subsampling, improved error messages

Documentation

Complete documentation, examples, and guides available at:
https://github.com/amaxiom/DataTypical

Includes:

Getting started tutorials
Comprehensive examples across scientific domains
Visualization interpretation guides
Advanced usage and computation details
Test suite and benchmarks

Support

GitHub Repository: https://github.com/amaxiom/DataTypical
Report Issues: https://github.com/amaxiom/DataTypical/issues
Questions & Discussions: https://github.com/amaxiom/DataTypical/discussions

Requirements

Python ≥ 3.8
NumPy ≥ 1.20
Pandas ≥ 1.3
SciPy ≥ 1.7
scikit-learn ≥ 1.0
Matplotlib ≥ 3.3
Seaborn ≥ 0.11
Numba ≥ 0.55

Citation

If you use DataTypical in your research, please cite:

@software{datatypical2026,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7.6},
  doi={10.5281/zenodo.18666410}
}

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

Acknowledgments

DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.7

Jun 3, 2026

This version

0.7.6

May 19, 2026

0.7.5

Mar 5, 2026

0.7.4

Feb 17, 2026

0.7.3

Feb 14, 2026

0.7.2

Feb 12, 2026

0.7.1

Feb 4, 2026

0.7.0 yanked

Feb 1, 2026

Reason this release was yanked:

numba dependency error

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatypical-0.7.6.tar.gz (46.9 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datatypical-0.7.6-py3-none-any.whl (46.8 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file datatypical-0.7.6.tar.gz.

File metadata

Download URL: datatypical-0.7.6.tar.gz
Upload date: May 19, 2026
Size: 46.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.6.tar.gz
Algorithm	Hash digest
SHA256	`8b31879a3279efc03e4d3622be7131c91457a5d9f7d27fdb6205b908fc3977cc`
MD5	`be4f914516669aac470fe77167ff04bb`
BLAKE2b-256	`09285e24b822179f2057b05655cdb29374f9490d17c6a87ab051a0f488b91b63`

See more details on using hashes here.

File details

Details for the file datatypical-0.7.6-py3-none-any.whl.

File metadata

Download URL: datatypical-0.7.6-py3-none-any.whl
Upload date: May 19, 2026
Size: 46.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for datatypical-0.7.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6209a56acb6966ff2d071dd2c02ca858daf6c3f32b9362bb0fa68ea38a51548a`
MD5	`59cb0330bf9e6f754c3b657a2ddddea6`
BLAKE2b-256	`634a2ce2653270ccd8e37751cc85971953da15181fb15db3da87b83466d91e6a`

See more details on using hashes here.

datatypical 0.7.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataTypical

Key Features

Installation

Quick Start

What DataTypical Does

Three Complementary Lenses

Dual Perspective (with Shapley)

Example: Drug Discovery

Key Parameters

Selective Computation

Visualization Functions

Multi-Modal Support

Tabular Data

Text Data

Graph Networks

Performance

Use Cases

What Makes DataTypical Different

What's New in v0.7.6

Documentation

Support

Requirements

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes