Explainable instance significance discovery for scientific datasets
Project description
DataTypical
Explainable Instance Significance Discovery for Scientific Datasets
DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.
Key Features
- Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously)
- Shapley Explanations: Feature-level attributions for why samples are significant
- Formative Discovery: Distinguish samples that ARE significant from those that CREATE the significance structure
- Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
- Multi-Modal Support: Tabular data, text, and graph networks through unified API
- Performance Optimized: Fast exploration mode and efficient Shapley computation
Installation
pip install datatypical
Quick Start
from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd
# Load your data
data = pd.read_csv('your_data.csv')
# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)
# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])
# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')
# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')
What DataTypical Does
Three Complementary Lenses
| Lens | Finds | Use Cases |
|---|---|---|
| Archetypal | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
| Prototypical | Representative, central samples | Dataset summarization, cluster centers, data coverage |
| Stereotypical | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |
The Power: All three computed simultaneously—different perspectives reveal different insights.
Dual Perspective (with Shapley)
When shapley_mode=True, DataTypical reveals two views:
- Actual Significance (
*_rank): Samples that ARE significant - Formative Significance (
*_shapley_rank): Samples that CREATE the structure
This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.
Example: Drug Discovery
# Analyze compound library
dt = DataTypical(
shapley_mode=True,
stereotype_column='activity', # Target property
fast_mode=False
)
results = dt.fit_transform(compounds)
# Find critical compounds (high actual + high formative)
critical = results[
(results['stereotypical_rank'] > 0.8) &
(results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")
# Find redundant compounds (high actual + low formative)
redundant = results[
(results['stereotypical_rank'] > 0.8) &
(results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")
# Understand alternative mechanisms
for idx in critical.index:
profile_plot(dt, idx, significance='stereotypical')
# Each shows different feature pattern → different mechanism
Discovery: Multiple structural pathways to high activity.
Key Parameters
DataTypical(
shapley_mode=False, # True for explanations
fast_mode=True, # False for publication quality
n_archetypes=8, # Number of extreme corners
n_prototypes=8, # Number of representatives
stereotype_column=None, # Target column for stereotypical
shapley_top_n=500, # Limit explanations to top N
shapley_n_permutations=100, # Number of permutations
random_state=None, # Set for reproducible results
max_memory_mb=8000 # Memory limit
)
Visualization Functions
from datatypical_viz import significance_plot, heatmap, profile_plot
# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')
# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)
# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')
Multi-Modal Support
Tabular Data
df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)
Text Data
texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)
Graph Networks
node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)
Performance
| Dataset Size | Without Shapley | With Shapley |
|---|---|---|
| 1,000 samples | ~5 seconds | ~5 minutes |
| 10,000 samples | ~30 seconds | ~60 minutes |
Optimization Strategy:
- Fast exploration (
fast_mode=True, no Shapley) - Identify interesting samples
- Detailed analysis (
shapley_mode=True, subset) - Generate publication figures
Use Cases
Scientific Discovery: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification
Dataset Curation: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling
Model Understanding: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations
What Makes DataTypical Different
From outlier detection: Finds extremes AND explains why
From clustering: Finds representatives maximizing coverage AND explains why
From feature selection: Explains which features matter for which samples
From PCA/t-SNE: Maintains interpretability in original feature space
The Novel Contribution: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.
Documentation
Complete documentation, examples, and guides available at:
https://github.com/amaxiom/DataTypical
Includes:
- Getting started tutorials
- Comprehensive examples across scientific domains
- Visualization interpretation guides
- Advanced usage and computation details
- Test suite and benchmarks
Support
- GitHub Repository: https://github.com/amaxiom/DataTypical
- Report Issues: https://github.com/amaxiom/DataTypical/issues
- Questions & Discussions: https://github.com/amaxiom/DataTypical/discussions
Requirements
- Python ≥ 3.8
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- scikit-learn ≥ 1.0
- Matplotlib ≥ 3.3
- Seaborn ≥ 0.11
- Numba ≥ 0.55
Citation
If you use DataTypical in your research, please cite:
@software{datatypical2025,
author = {Barnard, Amanda S.},
title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
year = {2026},
url = {https://github.com/amaxiom/DataTypical},
version = {0.7}
}
License
MIT License - Copyright (c) 2026 Amanda S. Barnard
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
Acknowledgments
DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatypical-0.7.3.tar.gz.
File metadata
- Download URL: datatypical-0.7.3.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ed54ecb68e779d0bf5498a27861885ea1d05e710152241cf0f25d7162d6d6bd
|
|
| MD5 |
cad1dda9febd7a2a17373662110c81fe
|
|
| BLAKE2b-256 |
9d5188a13289aa311cc1730310c798411d4922dd15bf69a5749fb619cbc6e137
|
File details
Details for the file datatypical-0.7.3-py3-none-any.whl.
File metadata
- Download URL: datatypical-0.7.3-py3-none-any.whl
- Upload date:
- Size: 44.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e85f490cb36f3955aa9bb3366a5b9ebc79f10bf26716495f4a972d475a2be8e
|
|
| MD5 |
43eeb8af78c187bde9db54f22e1c9ce9
|
|
| BLAKE2b-256 |
69e4b89a78d8c3d75157993ffc464df4f4b80a9a2505ab57dbb22edf12598287
|